Code Monkey home page Code Monkey logo

spam-rankings's Introduction

Waterloo Spam Rankings for the ClueWeb12 Dataset

This repository provides the code needed to extract the spam scores for the ClueWeb12 (cw12) dataset using spam models developed by Cormack, Smucker, and Clarke for ClueWeb09.

As part of a release of this repository, we include a large tar file, waterloo-spam-cw12-encoded.tar, that contains a gzip file for each of the cw12 directories. Each file was encoded using compress-spam12.c before being gzip'd. After gunzipping, each file must be decompressed using decompress-spam12.c. To fetch and uncompress all of the files do (assuming a linux-like setup and bash shell):

  wget https://github.com/UWaterlooIR/spam-rankings/releases/download/v1.0/waterloo-spam-cw12-encoded.tar
  wget https://raw.githubusercontent.com/UWaterlooIR/spam-rankings/main/decompress-spam12.c
  gcc -o decompress-spam12 decompress-spam12.c
  mkdir waterloo-spam-cw12-decoded  
  tar -xvf waterloo-spam-cw12-encoded.tar
  cd waterloo-spam-cw12-encoded
  for f in *.spamPct.gz ; do cat $f | gunzip -c | ../decompress-spam12 | gzip -c > ../waterloo-spam-cw12-decoded/$f ; done  

The tar is 654 MB. Decoded, but still gzip'd, the files are 2.6 GB.

The format of each decoded file is:

    percentile-score clueweb-docid

where the percentile score indicates the percentage of the documents in the corpus that are "spammier" as per the "fusion" spam score. The spammiest documents have a score of 0, and the least spammy have a score of 99. We have not extensively tested the spam scores on cw12, but they appear reasonable.

The docids are not listed in any particular order in each file.

The fusion spam score is the average score produced by the three models described in "Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets" with the modification that the "Britney" model has been trained on a very similar, but slightly different data set, from the the model used for ClueWeb09.

spam-rankings's People

Contributors

profsmucker avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.