Code Monkey home page Code Monkey logo

chuweb21d's Introduction

Chuweb21D: A Deduped English Document Collection for Web Search Tasks

The home page for the public data collections: Chuweb21 and Chuweb21D. You can refer to the paper Chuweb21D: A Deduped English Document Collection for Web Search Tasks (release soon) for more details of these two collections.

Background

Chuweb21D is a English document collection deduped from Chuweb21 that was used as the target corpus for the NTCIR-16 WWW-4 task. Our motivation for deduping Chuweb21 is based on our experience as the organisers of the NTCIR task: when conducting relevance assessments of the pooled documents for the task, we witnessed considerable amount of duplicate and near-duplicate web pages, which can potentially cause problems. For deduping, we employ the Simhash strategy with two different clustering thresholds (Hamming distance $\tau \le 2$ and $\tau \le 3$), and release two versions of Chuweb21D; the smaller collection (Chuweb21D-60) will be used as the target corpus for the upcoming NTCIR-17 FairWeb-1 (Group-Fair Web Search) task.

The following figure summarizes the construction procedures of the Chuweb21 and Chuweb21D collection:

construction procedures

Basic statistics

The following table shows the comparisons between some popular English document collections and our Chuweb21(B) datasets.

Name #Docs Space Collected time Format
TREC GOV2 25M 80GB Early 2004 raw html
MS MACRO Docs 3.2M 22GB Before Jan 2017 text (title; body)
Tensorflow c4 / 750GB Apr 2019 text
ClueWeb09 1.04B 5TB Jan 2009~Feb 2009 raw html
ClueWeb12 733M 5.54TB Feb 2012~May 2012 raw html
ClueWeb22 10B / Before Aug 2022 raw html
Chuweb21 82.5M 1.7TB Apr 2021 raw html
Chuweb21D-60 49.8M 1.2TB Apr 2021 raw html
Chuweb21D-70 57.9M 964GB Apr 2021 raw html

How to download

We have already uploaded Chuweb21 and Chuweb21D-60 to the web disk (Chuweb21D-70 is coming soon!), you can download the data through the shared link, which is listed below:

Dataset Link (Chinese mainland) Link (others) Md5 code
Chuweb21 Baidu Cloud (code: t828) Google Drive md5.txt
Chuweb21D-60 Baidu Cloud (code: a6j2) TeraBox (code: wtsh) md5.txt
Chuweb21D-70 Baidu Cloud (code: v6xh) Part1: TeraBox (code: 75im)
Part2: TeraBox (code: spq8)
md5.txt

In addition to web disk download, we also support both hard disk shipping (only for users in Chinese mainland) and server SCP for data delivery. You can contact us via e-mail ([email protected]) if needed.

Dataset structure

Both Chuweb21 and Chuweb21D collections are organized as below:

data
├── CC-MAIN-XXX     // eight folders (named as time interval), each contains 640 (or 639) warc.gz files                      
│   ├── CC-MAIN-XXX-00000.warc.gz     // warc.gz file which compresses plently of HTML documents
│   └── ... ...
│   └── CC-MAIN-XXX-00639.warc.gz
├── md5_checksum.txt     // md5 codes for each warc.gz

8 directories, 5119 files

The HTML documents are organized with "warc.gz" format (an about 70MB sample file: sample.warc.gz). Here we provide a sample Python script to read the "warc.gz" file:

import warc # pip install warc3-wet
import traceback
with warc.open("sample.warc.gz") as f:
    for record in f:
        try:
            url = record['WARC-Target-URI'] # html url
            uid = record['WARC-RECORD-ID']
            uid = uid.replace("<urn:uuid:", "").replace(">", "") # html doc id
            content = record.payload.read() # html content
        except:
            traceback.print_exc()

Authors

E-mail: [email protected]

Zhumin Chu (Tsinghua University, P.R.C.)

Tetsuya Sakai (Waseda University, Japan)

Qingyao Ai (Tsinghua University, P.R.C.)

Yiqun Liu (Tsinghua University, P.R.C.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.