ETH Zurich - Web Scale Data Processing and Mining Project

This is the main repository for the web scale data mining project, which took place in summer 2014 as a research project.

Directory Structure and Overview

└── src - the source code projects, see below
    ├── WSDA
    ├── combine_sequence_files
    ├── examples
    │   ├── spark_example
    │   └── word_count_1
    ├── html_to_text_conversion
    ├── remove_infrequent_words
    ├── results_display
    ├── scripts
    └── word_count

Repositories

This is the code repository
The runs and the raw results can be found in this repository
The hadoop config is here
The spark config is here

Project Management, Documentation

Google Drive

Source code projects

WSDA

The self-implemented LDA

@hany-abdelrahman: the WSDA directory should probably be renamed to something more meaningful 😉 TODO: add some more doc, references, etc.

Author: Hany Abdelrahman

combine_sequence_files

Combines sequence files from subdirectories into multiple sequence files. These sequence files have the same name as the subdirectories.

This way, it is possible to create a flat directory structure whith few large sequence files.

Author: Lukas Elmer

examples

Contains a spark example project and a simple word count application. Only for dev env setup purposes.

Author: Lukas Elmer

html_to_text_conversion

Converts web archive records into sequence files, removing all HTML / JS tags using boilerplate and doing some additional steps:

remove stopwords
remove words with non a-z characters
try to remove non-english documents
remove numbers
remove URLs
convert uppercase to lowercase charaters
apply stemming (org.apache.lucene.analysis.en.EnglishAnalyzer)

remove_infrequent_words

Removes words which appear infrequent. Needs a word count dictionary as input.

Example how to use it

Author: Lukas Elmer

results_display

A script to help displaying the topics. Generates

A readable text version
A tag cloud for each topic, each word size weighted by the probability of the word

Author: Lukas Elmer

word_count

Simple word count for sequence files.

Example how to use it

Author: Lukas Elmer

ominux / ethz-web-scale-data-mining-project Goto Github PK

ethz-web-scale-data-mining-project's Introduction

ETH Zurich - Web Scale Data Processing and Mining Project

Directory Structure and Overview

Repositories

Project Management, Documentation

Source code projects

WSDA

combine_sequence_files

examples

html_to_text_conversion

remove_infrequent_words

results_display

word_count

ethz-web-scale-data-mining-project's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent