Code Monkey home page Code Monkey logo

clickstream-hadoop's Introduction

Clickstream-Hadoop

Predicting Website Navigation with MapReduce in Hadoop

Presentation can be found here.

Dataset

This project uses the English Wikipedia clickstream which can be found here.

Statistics Generated

  • users browsing behavior - top 10 bread crumb trails that most users follow (Ryan)
  • most popular - top 10 most visited wikipedia pages (Qizhou)
  • external, link, and other - what wikipedia page is most often linked to externally, what page is most often linked to from within Wikipedia, and what page is most often linked to from other (referrer missing) (Parth)

NextClick

Description

This MapReduce module creates a TSV file with the most likely next click for any link on Wikipedia. This file can be used in other scripts to predict user browsing behavior.

Compiling / Running

Make sure your required files are in HDFS and the output dir is empty NextClick writes to output/next_click by default and takes the input TSV file path as the first argument.

hadoop fs -mkdir -p src/main/resources
hadoop fs -copyFromLocal src/main/resources
hadoop fs -rm -r output/next_click
mvn clean install
hadoop jar target/click-stream-0.0.1.jar hadoop.NextClick src/main/resources/clickstream-enwiki-2018-10.tsv.gz

TypeCount

Description

This MapReduce module determines how many pages are landed on based on referrer type: external, link within Wikipedia, or other (unknown).

Compiling / Running

Make sure your required files are in HDFS and the output dir is empty TypeCount writes to output/type_count by default and takes the input TSV file path as the first argument.

hadoop fs -mkdir -p src/main/resources
hadoop fs -copyFromLocal src/main/resources
hadoop fs -rm -r output/type_count
mvn clean install
hadoop jar target/click-stream-0.0.1.jar hadoop.TypeCount src/main/resources/clickstream-enwiki-2018-10.tsv.gz

TopTen

Description

This MapReduce module computes the top ten most popular links for each of the three types: link (within Wikipedia), other (unknown referrer), and external (outside Wikipedia). It runs in two parts, the first part summing all of the referred to links aggregated by type and name and the second part using that data to produce the top ten lists.

Compiling / Running

Make sure your required files are in HDFS and the output dir is empty TopTen writes to output/top_ten_part1 and output/top_ten_part2 by default and takes the input TSV file path as the first argument.

hadoop fs -mkdir -p src/main/resources
hadoop fs -copyFromLocal src/main/resources
hadoop fs -rm -r output/top_ten*
mvn clean install
hadoop jar target/click-stream-0.0.1.jar hadoop.TopTen src/main/resources/clickstream-enwiki-2018-10.tsv.gz

clickstream-hadoop's People

Contributors

falconpd avatar pvp51 avatar rxt1077 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.