SIT772 database and information Assignment 2
Assignment 2
Introduction
- This assessment is for students to develop the understanding of information retrieval techniques.
- This is an individual assessment task.
- The project documentation submitted should include the answers, working, associated tables and graphs to the tasks
- This assignment has a total of 20 marks and is worth 20% of your final result
Unit Learning Outcomes
- Of the three Unit Learning Outcomes (ULOs) of this unit SIT772, this assessment task will focus on the last two ULOs. These are: o ULO 3 - At the end of this unit students will be able to design and develop relational databases by using SQL and a database management system. o ULO 5 - Develop problem solving skills in the context of data processing systems.
o ULO 6 - Work independently on self-directed learning tasks
- The assessment of this task will indicate whether students can partially attain these unit learning outcomes.
Instructions
- Read these instructions and the following 2 questions.
- Answer as many questions as possible.
- Place your name, ID and answers in your document.
- Please answer all questions in a single document and submit this to the assignment folder.
Task 1: Zipf’s Law (5+5=10 Marks)
- Provide a brief description of Zipf’s Law and how this is related to information retrieval (searching for term/words in a corpus).
- Assuming Zipfs law with the most frequent term appeared 20% of word occurrences. What is the fewest number of most common words that together account for more than 60% of word occurrences (i.e. the minimum value of m such that at least 60% of word occurrences are one of the m most common words). You can use a table to help present your result.
Task 2: Information Retrieval (IR) Evaluation (3+3+4=10 Marks)
The following data displays retrieval results for two different algorithms (Algorithm 1 and Algorithm 2) in response to two distinct queries (Query 1 and Query 2). An expert has manually labelled each of the documents as being either relevant or not relevant to the queries.
Algorithm 1 Returns the following results:
Query 1: |
d33 |
d6 |
d9 |
d48 |
d56 |
d76 |
d10 |
d29 |
d30 |
d5 |
d11 |
d66 |
d3 | ||||||||
Query 2: |
d10 |
d76 |
d5 |
d67 |
d13 |
d45 |
d91 |
d16 |
d17 |
d22 |
d20 |
d71 |
d48 |
d60 |
d25 |
d27 | |||||
Algorithm 2 Returns the following results: | ||||||||||
Query 1: |
d44 |
d41 |
d7 |
d77 |
d13 |
d14 |
d90 |
d80 |
d70 |
d4 |
d8 |
d29 |
d6 |
d5 |
d15 |
d17 |
d20 |
d65 |
d2 |
d33 |
Query 2: |
d9 |
d91 |
d99 |
d30 |
d17 |
d13 |
d26 |
d93 |
d42 |
d79 |
d12 |
d10 |
d41 |
d11 |
d85 |
d89 |
d1 |
d49 |
d52 |
d76 | |
d20 |
d43 |
d88 |
d7 |
d98 |
d51 |
d50 |
d6 |
d3 |
d87 | |
d2 |
d28 |
d15 |
d14 |
An expert has identified the following documents as being relevant to Query 1 and Query 2, respectively.
Relevant to Query 1: |
d8 |
d13 |
d29 |
d33 |
d41 | |||||
Relevant to Query 2: |
d2 |
d3 |
d7 |
d8 |
d9 |
d11 |
d12 |
d13 |
d15 |
d16 |
d20 |
Objectives:
- For Algorithm 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 1 (all three curves should be on a single chart).
- For Algorithm 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Algorithm 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).
- Plot the averages for Algorithm 1 and Algorithm 2 on a separate chart, and compare the algorithms in terms of precision and recall. Do you think one of the algorithms is superior? Provide a brief explanation of why this is the case?