SIT772 database and information Assignment 2

Assignment 2

Introduction

This assessment is for students to develop the understanding of information retrieval techniques.
This is an individual assessment task.
The project documentation submitted should include the answers, working, associated tables and graphs to the tasks
This assignment has a total of 20 marks and is worth 20% of your final result

Unit Learning Outcomes

Of the three Unit Learning Outcomes (ULOs) of this unit SIT772, this assessment task will focus on the last two ULOs. These are: o ULO 3 - At the end of this unit students will be able to design and develop relational databases by using SQL and a database management system. o ULO 5 - Develop problem solving skills in the context of data processing systems.

o ULO 6 - Work independently on self-directed learning tasks

The assessment of this task will indicate whether students can partially attain these unit learning outcomes.

Instructions

Read these instructions and the following 2 questions.
Answer as many questions as possible.
Place your name, ID and answers in your document.
Please answer all questions in a single document and submit this to the assignment folder.

Task 1: Zipf’s Law (5+5=10 Marks)

Provide a brief description of Zipf’s Law and how this is related to information retrieval (searching for term/words in a corpus).
Assuming Zipfs law with the most frequent term appeared 20% of word occurrences. What is the fewest number of most common words that together account for more than 60% of word occurrences (i.e. the minimum value of m such that at least 60% of word occurrences are one of the m most common words). You can use a table to help present your result.

Task 2: Information Retrieval (IR) Evaluation (3+3+4=10 Marks)

The following data displays retrieval results for two different algorithms (Algorithm 1 and Algorithm 2) in response to two distinct queries (Query 1 and Query 2). An expert has manually labelled each of the documents as being either relevant or not relevant to the queries.

Algorithm 1 Returns the following results:

Query 1:	d33	d6	d9	d48	d56	d76	d10	d29	d30	d5
Query 1:	d11	d66	d3
Query 2:	d10	d76	d5	d67	d13	d45	d91	d16	d17	d22
Query 2:	d20	d71	d48	d60	d25	d27
*Algorithm 2* Returns the following results:
Query 1:	d44	d41	d7	d77	d13	d14	d90	d80	d70	d4
Query 1:	d8	d29	d6	d5	d15	d17	d20	d65	d2	d33

Query 2:	d9	d91	d99	d30	d17	d13	d26	d93	d42	d79
	d12	d10	d41	d11	d85	d89	d1	d49	d52	d76
	d20	d43	d88	d7	d98	d51	d50	d6	d3	d87
	d2	d28	d15	d14

An expert has identified the following documents as being relevant to Query 1 and Query 2, respectively.

Relevant to Query 1:	d8	d13	d29	d33	d41
Relevant to Query 2:	d2	d3	d7	d8	d9	d11	d12	d13	d15	d16
Relevant to Query 2:	d20

Objectives:

For Algorithm 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 1 (all three curves should be on a single chart).
For Algorithm 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).
Plot the averages for Algorithm 1 and Algorithm 2 on a separate chart, and compare the algorithms in terms of precision and recall. Do you think one of the algorithms is superior? Provide a brief explanation of why this is the case?