Code Monkey home page Code Monkey logo

719-p2-starter's Introduction

719-p2-startup

The starter package for 15719 (Spring 2021) Project 2, Part 1

  • data/get_WARC_dataset.sh is a simple script that downloads Common Crawl data and stores it in HDFS. It takes in the number of WET files to download, and launches a Spark job to download them in parallel. In the name of downloaded files, backslashes ("/") are replaced with underscores ("_"). For example, when crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.wet.gz is downloaded, it is stored as /common_crawl_wet/crawl-data_CC-MAIN-2016-50_segments_1480698540409.8_wet_CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.wet.gz in HDFS.

  • data/reference_output_for_test_case_A is the reference output for the descired statistics computed for test case A in Part 1.

  • infra/config.yaml - config consumed by flintrock to set up your Spark Cluster.

  • infra/cluster-tool.sh - use this flintrock wrapper to set up and tear down the cluster.

  • code/run.sh and code/spark_etl.py - boilerplate to get you started.

  • submit is used to run test for grading and submit your solution. Run it as ./submit <code-path> <test-id> <data-path> <data-file-names> <stop-words-file>, the arguments are:

    • is the local directory that contains your driver program and the run.sh script. It should contain nothing else.
    • is the single letter (A, B, C, D, or E) that identifies each test case described above. Please make sure the number of slave instances match the test specification or your grading will fail.
    • is the path in HDFS under which the WET files for testing are stored.
    • is the file that contains the names of the WET files to be processed.
    • is the path to the stop-words file.
  • data/wet_hashes.txt and data/wet_sizes.txt - sha1sums, and sizes in bytes, for all WET files, for your reference (in case you're worried about data corruption/interrupted downloads).

Pulling starter updates

  1. In case there're any updates in starter code, we will post patch files on Piazza. Make sure to check Piazza frequently. Once you get the .patch files, you can apply the patches on your code:
$ git apply <file>.patch
$ git diff # review changes
  1. If there're conflicts, you'll see messages showing "error: patch failed". Use cat <file>.patch to check the change and try to apply it manually. Please post on Piazza if you encounter any difficulties

719-p2-starter's People

Contributors

anku94 avatar mkuchnik avatar kevinhsieh avatar yyypasserby avatar allenchou avatar

Stargazers

 avatar TingMiao avatar

Watchers

James Cloos avatar  avatar Andrew Chung avatar  avatar George Amvrosiadis avatar  avatar Majd Sakr avatar David Simon avatar  avatar  avatar  avatar Timothy Kim avatar jian-en avatar ZHU Xinyu avatar David avatar Marshall An avatar Cameron avatar Sandesh Khot avatar Yunfei Cheng avatar Shilpa George avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.