CS 585 Spring 2015 Homework 5 “search”

CS 585 Spring 2015
Homework 5
“search”
Due Thursday 26 March
Submit by midnight
Electronic submission to the CS portal for CS585
http://www.cs.uky.edu/csportal
Overview In this assignment you will implement a simple information retrieval system for returning ranked web page results based on a user-­‐supplied query. Your results will take into account ranking based on your estimate of PageRank from the last assignment. You will turn in all your files in a single zip file via the CS Portal: https://www.cs.uky.edu/csportal. Instructions for turning in assignments via the portal are on the course web page at http://dmn.netlab.uky.edu/~seales/cs585.html Details You have now implemented a basic crawler and have estimated PageRank for a real link graph. In this assignment we turn our focus to the search problem, which will incorporate basic techniques from information retrieval. In this project you will use your estimate of PageRank from the previous assignment in order to provide ranked results based on a natural language query supplied by a user. The goal of your implementation is to accept a user-­‐supplied query and return a list of ranked results that “match” the query, or is some way are relevant to the terms in the query. Here are the specific tasks you must complete for this assignment: • Use your crawler and ranking system from HW2/3 (or an open source tool if you prefer) to create an inverse index for the set of text-­‐based web pages in your link graph. As before, create as complete a link graph/index as you can for the subdomain engr.uky.edu. • Rank the pages in the link graph via the PageRank algorithm. • For any user-­‐supplied query of search terms, respond with a list of the top k web pages in the link graph. Rank the top pages via your ranking algorithm. • At a minimum, your ranking algorithm should (1) select pages that contain some/all the search terms using the inverted index; and (2) rank the selection set of pages via PageRank. As with earlier assignments, you may work together with your class colleagues to discuss tools, implementation issues, etc. Once you have your own working implementation, you must run your own data collection and ranking computation. Your written code and your report must be your own. Additional Information Use any method you choose to store data and create from the data your inverted index. You may decide how to tokenize the text: at a minimum you should eliminate stop-­‐words (you can decide on a list) and, if you choose, focus on parsing the text in specific target fields (e.g., paragraph tags). Report on your choices in your write-­‐up, and be clear and specific about exactly how you are ranking based on (1) the index and retrieval method; and (2) PageRank. Provide example queries, their results, and a critique of how well your “search engine” works. Pick specific examples that illustrate how well (or not well) your information retrieval and ranking strategy works. Turning in your Work Create a pdf file of your report, and submit any other files (your source code, spreadsheets, graphs, etc.) that you want me to see all together in a zip file for submission. Please use a directory structure to organize everything. Upload your single zip file via the CS portal for this class (link on the class website) before midnight on the due date. Frequently Asked Questions Q: I still don’t understand PageRank. How can I use it for this new assignment? A: This is your chance to finish your implementation of PageRank and apply it. Q: My index is huge. What do I do? A: You can focus your crawler on specific pages (html) and specific tags within those pages (paragraph) to try to restrict the size. As an extreme case you could use just the title of each page as its set of words to index. I wouldn’t be that restrictive, but you have a spectrum you can work on while you develop your set of retrieved and indexed pages. Q: I found a crawler I like called “Stingo”. It already does everything, including PageRank. Can I use that? A: Yes. In this assignment the focus is on information retrieval and search. Feel free to use Stingo to build your link graph and repository of text. You should, however, create your own inverted index, and write your own ranking algorithm. Overall it would be better if you could build on your prior HW solutions. Q: I am an engineer, not a report-­‐writer. May I just turn in all my data and files and let you figure out what happened in my rankings? A: No. And your reporting should be getting better as you do more of them. Q: Will future assignments in this class be building on my crawler + ranking code? A: No. This is the last individual assignment. The next step is a group project. Q: Will I find what I’m searching for? A: How good is your search engine?