Information retrieval and web search assignment 1

1 Description

Solve problems defined below. You should submit an archive containing all your code together with a README.txt file describing how to compile and execute each of your programs. All of your programs should be a console applications, WITHOUT any user interface. Make sure that your code compiles on any machine (if you use languages that need to compiled). Totally you can get 100 points.

2 Tasks

2.1 Task 1 (40 points)

Write a script or a program that reads a text file, pre-processes it and saves the results into a new file. File contains documents, one document per line. Document is one of few sentences.

Your program should take two parameters: input file name and output file name. It should pre-process documents that they can be later used to create an inverted index. Basic pre-processing should consider:

  • punctuation
  • tokenization
  • lower-casing/upper-casing
  • stop words removal
  • stemming

Your program should write the pre-processed documents into the output file.

You are allowed to use external libraries/packages to perform stop words removal and stemming (there are plenty of them for Java and Python, you need to check for other languages).

2.2 Task 2 (20 points)

Write a script or a program that reads a text file of documents (you can assume these are pre-processed already), creates an inverted index and saves to file.

Your program should take two parameters: input file name and output file name. Input file is made of documents, the line number of the document is its identifier.

Each line of the output file should contain the term and identifiers of documents that the term occurs in.

2.3 Task 3 (20 points)

Write a script or a program that reads a text file containing an inverted index, creates the TF-IDF weights matrix and saves it in a file.

Your program should take two parameters: input file name out output file name. Input file can be assumed is a representation of an inverted index where each line contains the term and identifiers of documents where that term occurs in.

Your output file should contain the weights matrix, document identifiers as the header of each column and terms as the first column.

2.4 Task 4 (20 points)

Write a script or a program that reads a text file containing a weights matrix defined in the previous task, and two additional parameters that are documents’ identifiers. Your program should return the cosine similarity value of those two documents.