Student Name: ... University of Texas at Dallas

Student Name:
Student NetID:
University of Texas at Dallas
Department of Computer Science
CS6322 – Information Retrieval Section: ____________
Fall 2014
Instructor: Dr. Sanda Harabagiu
Take-Home Mid-Term Exam
Issued: October 15th 2014
Due: October 22nd 2014 –in class
________________________________________________________________
Problem 1 :
Consider the following three short documents:
Doc #1
__________________________________________________________________
The researchers will focus on computational phenotyping and will
produce disease prediction models from machine learning and statistical tools.
___________________________________________________________________
Doc #2
___________________________________________________________________
The researchers will develop tools that use Bayesian statistical information
to generate causal models from large and complex phenotyping datasets.
___________________________________________________________________
Doc #3
___________________________________________________________________
The researchers will build a computational information engine that uses machine learning
to combine gene function and gene interaction information from disparate genomic data sources.
___________________________________________________________________
A. (18 points) First remove stop words and punctuation; parse manually the documents and
select the terms from the 3 documents and created the dictionary (3 points). Create the document
vectors by computing three weights: (i) binary weights (3 points); (ii) raw weights (4 points); and
(iii) TF-IDF weights (8 points).
For each form of weighting list the document vectors in the following format:
Term1
Term2 Term3 Term4 Term5 Term6 Term7 Term8 ...
____________________________________________________________
DOC1 0
3
1
0
0
2
1
0
….
DOC2 5
0
0
0
3
0
0
2
….
DOC3 3
0
4
3
4
0
0
5
….
____________________________________________________________
SOLUTION 1.A:
1
B. (12 points) Create and inverted (sorted) list index of the three documents, including the dictionary
and the postings. The dictionary should also contain (for each term) statistics such as total number of
occurrences in the collection. The postings for each term should contain the document ids and the term
frequencies. List the index for the first 5 terms from the dictionary only. You do not need to list the
entire index.
SOLUTION 1.B:
C. (12 points) What are the hit lists for the following Boolean queries (in each case explain how you
obtained them from the inverted index):
1.
2.
3.
4.
researcher AND information (3 points)
(researcher AND gene) OR (researcher AND tool) (3 points)
(model AND disease AND dataset) OR source (3 points)
(phenotype OR engine) AND (disease OR focus) (3 points)
SOLUTION 1.C:
D. (24 points) Compute the similarity coefficients for each of the four queries and each of the three
documents using: (i) the cosine similarity (4 points for each query); (ii) the Jaccard similarity (2 points
for each query).
SOLUTION 1.D:
Problem 2 :
Suppose you have a collection of 5 documents, and only 8 terms are used:
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
DOC1
DOC2
DOC3
DOC4
DOC5
_____________________________________________
0
7
1
0
5
2
1
0
5
0
0
9
3
0
0
2
3
0
6
3
4
0
1
5
1
8
0
3
0
1
4
0
2
7
0
0
0
5
4
2
______________________________________________
List the values of the gaps for the first four terms in your index computed for this collection. Encode these
gaps with (i) unary codes (8 points); (ii) Elias gamma codes (16 points); and (iii) Elias delta codes (10
points). You are allowed to write a program to enable you computing the codes. Please add to the exam
the code of the program if you chose to use one.
SOLUTION 2.I:
SOLUTION 2.II:
SOLUTION 2.III:
2