Code Monkey home page Code Monkey logo

mapreduce-join-and-pig-latin-query's Introduction

MapReduce and Pig Latin Query

To get some hands on experience with the pig/latin language and capabilities, you will implement a few different queries -- aggregating content from two different collections. For you to have a better appreciation for the features provided by Pig Latin, you will also implement a join between the two collections we provide using both MapReduce and Pig Latin.

As input you will read from two CSV files that contain descriptions about users and tweets, respectively will be upload into your blackboard (each group will have different users and tweets file).

Each line in the user collection contains login, name and state from a specific user. Each line in the collection of tweets has the tweet id, content, and a reference to the user who wrote that tweet.

Here are the problems you will solve:

  1. Write a Pig Latin query that outputs the login of all users in NY state.
  2. Write a Pig Latin query that returns all the tweets that include the word 'favorite', ordered by tweet id.
  3. Write the equivalent join using Pig Latin.
  4. Write a Pig Latin query that returns the number of tweets for each user name (not login). You should output one user per line, in the following format: user_name, number_of_tweets
  5. Write a Pig Latin query that returns the number of tweets for each user name (not login), ordered from most active to least active users. You should output one user per line, in the following format: user_name, number_of_tweets
  6. Write a Pig Latin query that returns the name of users that posted at least two tweets. You should output one user name per line.
  7. Write a Pig Latin query that returns the name of users that posted no tweets. You should output one user name per line.

Requirements

  1. You can write your join in Java.

Java: In addition to the source code, you will submit a JAR file that can be called using the following command:

hadoop -jar join.jar Join -input1 <file_name> -input2 <file_name> -output <file_name>

Note: Copy the input files to your HDFS directory. Your program should read the input files from the HDFS directory.

  1. You must write the Pig Latin code as stand-alone scripts, one script per query. The name of the file should correspond to the name of the query. For example, using the first query, you should be able to run:
pix -x local 1.pig

For each query, the result should be in files whose names have as prefix the query name, and extension '.result'. For instance, for the first query the result should be in 1a.result

mapreduce-join-and-pig-latin-query's People

Contributors

likweitan avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

soyarbeanery

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.