ISIT312/ISIT912 Big Data Management Assignment 1

Task 1. PutCombine HDFS Application (3 marks)

This task use the sourcecode PutCombine.java and data in FilesToBeMerged.zip which are available on the Moodle site.

The PutCombine application extends Hadoop’s own functionality. The motivation for this application comes when we want to analyse fragmented files such as logs from web servers. We can copy each file into HDFS, but in general, Hadoop works more effectively with a single large file rather than a number of smaller ones. Besides, for analytics purposes we think of all the data as one big file, even though it spreads over multiple files as an incidental result of the physical infrastructure that creates the data.

One solution is to merge all the files first and then upload the combined file into HDFS. Unfortunately, the file merging will require a lot of disk space in the local machine. It would be much easier if we could merge all the files on the fly as we upload them into HDFS.

What we need is, therefore, a “put-and-combine”-type of operation. Hadoop’s command line utilities include a “getmerge” command for merging a number of HDFS files before copying them onto the local machine. What we’re looking for is the exact opposite, which is not available in Hadoop’s file utilities.

The attached sourcecode of PutCombine is a Java application that fulfils this purpose.

You are require to:

  1. Compare PutCombine with the FileSystemPut and FileSystemPutAlt applications in the lecture note and describe the difference. You must link the difference to the sourcecode. (1.5 marks)
  1. Compile java and create a jar file, and use to it to upload and merge the files in FilesToBeMerged.zip (you need to unzip it first). Also use HDFS shell command to show the output of this application, i.e., the merged file. (1.5 marks)

Deliverables:

A file solution1.pdf with your answers to the above two questions. For the second question, the report should include: (i) all the step-by-step commands you use to compile and run the application, (ii) a brief explanation of the purpose of each command, and (iii) the return of the HDFS shell command.

Task 2. MapReduce Model (3 marks)

A file customers.txt has the following contents.

00001 James

00002 Harry

00003 Peter 00004 Jane ... ...

The numbers in the first column represent a customer number and the names in the second column represent customer name.

A file orders.txt has the following contents.

0000001 00001 34.5

0000002 00001 23.0

0000003 00002 123.0 0000004 00003 12.3 ... ... ...

The numbers in the first column represent order number, the numbers in the second column represent customer number, and the number in the third column represent a total order value.

An objective of this task is to join the rows in a file customers.txt with the rows in a file orders.txt over an equality condition on the values of customer number.

Assume that both files have been loaded to HDFS. Explain how would you implement Map phase and Reduce phase of MapReduce application to join the rows from both files over an equality condition on the values of customer number.

This task does not require you to write any code in Java. However, the comprehensive explanations on how to join the rows are expected. You are allowed to (but not must) support your explanations with the fragments of pseudocode.

Deliverables:

A file solution2.pdf with your comprehensive explanations on how to join the rows in a file customers.txt with the rows in a file orders.txt over an equality condition on the values of customer number. 

Task 3. Patent Claim MapReduce Application (4 marks)

This task is to process the data in a file named apat63_99.txt, which is in the “dataset” folder at the Desktop of BigDataVM.

The file apat63_99.txt contains information about almost 3 million U.S. patents granted between January 1963 and December 1999. (See http://www.nber.org/patents/ for more information.) The following table describes (some) meta-information about this data set.

Attribute

Name

Description

PATENT

Patent number

GYEAR

Grant year

GDATA 

Grant date, given as the number of days elapsed since January 1, 1960

APPYEAR

Application year (available only for patents granted since 1967)

COUNTRY

Country of first inventor

POSSTATE

State of first inventory (if country is U.S.)

ASSIGNEE

Numeric identifier for assignee (i.e., patent owner)

ASSCODE

One-digit (1-9) assignee type. (The assignee type includes U.S. individual, U.S.

government, U.S. organization, non-U.S. individual, etc.)

CLAIMS

Number of claims (available only for patents granted since 1975)

NCLASS

3-digit main patent class

The objective of this task is to implement a MapReduce application that computes the total number of claims in all patents per country per year since 1975. Thus, you will need to use attributes PATENT, GYEAR, COUNTRY and CLAIMS.

Requirements:

  • Name the Java class as “TotalClaimsByCountryByYear”.
  • The class must implements a ToolRunner and Partitioner. The Partitioner sorts the output into three groups according to the years: (1) 1975-1984, (2) 1985-1994 and (3) after 1994.
  • Load apat63_99.txt to HDFS and run the new application to process it.
  • Use HDFS shell command (in Zeppelin) to show the file(s) containing the output after 1994.

Deliverables:

  1. The sourcecode java.
  2. A file pdf that includes (i) all the step-by-step commands you use to compile and run the application, (ii) a brief explanation of the purpose of each command, and (iii) the return of the HDFS shell command.
citation generator
citaion generator
make money online