IT-Universitetet i København  
 
 
 


Project Cluster: Programming Workshop 2009

October 2009

Introductory Programming (GPN)
Object-oriented Programming (OOP)


Table of contents

This page:

Introductory information

Welcome to the homepage of the Search Engine Project. The Search Engine Project is a 7.5 ECTS project on implementing the core functionality of a simple, text-based search-engine (think Google).

In order to account for the difference in programming experience between the different students, the project is divided in three parts:

  • The joint part should be completed by all the students, regardless of whether they come from GPN or OOP. It consists of creating the core of the search engine, and a fully functional program will be obtained at the end.
  • The advanced part is only mandatory for OOP students. It will add some features and optimizations to the search engine.
  • Finally, the extensions are a list of optional features for the engines. Students who have completed their mandatory tasks can pick any that they find interesting and start working on them.

The project will be carried out in groups. Each group must consist of three or four persons.

Each group is required to do a project agreement and to describe its work in a written report. The course ends with an oral exam.

There will be a weekly meeting, on Monday between 13:00 and 15:00. We will first discuss general issues and then offer more individual help. We meet in Room 4A20, except for October 26 and November 9 where we meet in Room 4A14.

At the first project day, October 19th, the project will be presented, and the programming techniques required for solving the joint part will be presented. Therefore, in the description of the assignment below, the presence of unfamiliar terms should not discourage you. Parts of the text will be especially incomprehensible if you are unfamiliar with linked lists; linked lists are one of the programming techniques discussed at the first project day.


The assignment

Data files

Three data files containing (parts of) old ITU web pages are available for use as field data for the project. The following example illustrates the file format:

*PAGE:http://www.itu.dk/index.html
Here
is
a
word
and
one
word
more
*PAGE:http://www.itu.dk/anotherpage.html
Here
is
more
and
yet
more

The idea is that a data file contains information about a series of web pages. The information for a web page starts with a *PAGE line which denotes at which address the page can be found. Such an address is called a URL. The list of words contained in the page follows the URL. The words are listed one word per line in the same order as they occur on the page. As seen in the example, each word can have several occurrences on the same page.

The following three data files are available:

The program SearchCmd

As part of the background material, the program SearchCmd.java is available. It implements a small search engine. The program takes the name of the data file used for searching as a parameter.

In the program, the given parameter is accessible in args[0] where args is the argument to the main method. The program then constructs a linked list (see data structures) with all the lines from the data file given in args[0]. The objects in the linked list contain the fields str and next. The field str contains a line from the data file and next points to the next object in the list. After reading the data file, a simple user interface launches, enabling the user to query the search engine: Given a word, the engine informs the user whether or not the word was present in the input.

The joint part

The joint part consists of solving the following 4 assignments. These assignments will be presented in detail during the first meeting, on October 19th.
  1. Compile and run the program SearchCmd.
  2. Modify SearchCmd so that if the user requests a search for a particular word (say, "Hans"), the URLs of all pages where the word "Hans" occurs are written to the screen. This has to be done by changing the search method in SearchCmd. It is not done by changing the data structure.
  3. Modify the construction of the data structure in such a way that a linked list of objects containing three fields is created. The three fields are str, next and urls. The fields str and next are like in SearchCmd, with the exception that there are only words in the list and each word only occurs once. In case an object contains the word "Hans" in the field str, the object's urls field will be a pointer to a list of URLs of all pages that contain the word "Hans". A list of URLs consists of objects with two fields; next and url. The field next is a pointer to the next object in the list and the field url is (a pointer to) a URL. After construction of this data structure, you have to program the corresponding search procedure(s). The data structure created in this assignment has smaller size than the structure from SearchCmd and is faster to search through.
  4. Modify the solution from assignment 3 so as to use a hashtable (see data structures) instead of a linked list of words. You can create the data structure using chained hashing. This means that each entry contains a reference to a linked list of URLs. The use of a hashtable will have a dramatic effect on both the time it takes to initialize the search engine and the time it takes to perform individual searches.

In the solution of the above assignments it is not allowed to use any packages other than java.io and java.lang. As a consequence it is not allowed to use any of the classes from the java.util package: One of the purposes of this project is to learn how to program various data structures yourself, as opposed to just using data structures programmed by others.

As you can see, you are allowed to use different methods on String objects. It is also allowed to use the method hashCode and to allocate arrays. If you have any doubt whether some Java functionality is allowed for use or not, contact one of the teachers.

The advanced part

For the assignments after the joint part, there are no restrictions on the Java functionality you may use. We however recommend that you do not spend too much time looking for advanced and unnecessarily complicated features in the vast collection of java libraries.

The advanced part is mandatory for OOP students. The task is to add support for the boolean operators AND and OR to the search engine. The resulting search engine should accept three kinds of input strings:

  1. Ordinary words, as before. Example: "Hans".
  2. A word AND another word. Example: "Hans AND Grethe".
  3. A word OR another word. Example: "Hans OR Grethe".

For example, the input "Hans AND Grethe" should result in a list of those pages where both words "Hans" and "Grethe" occur, while the input "Hans OR Grethe" should result in a list of those pages where either "Hans" or "Grethe" (or both) occurs. You are not required to support more complex expressions containing several occurrences of AND and OR. (If you are interested in solving that problem, see one of the suggested extensions below.)

As a starting point you should use your solution to assignment 4 of the joint part (i.e., the hashtable-based solution).

The advanced part is deliberately more open-ended than the joint part. However, we recommend that you proceed as follows:

  1. Modify the search method in SearchCmd such that it recognizes (but does not yet respond to) the three kinds of input strings listed above; you may want to use the java.util.StringTokenizer class. If the input is an ordinary word, a normal search is performed. If the input is, e.g., "Hans AND Grethe", the method simply prints "And: Hans Grethe" instead of performing a search. If the input is, e.g., "Hans OR Grethe", the method simply prints "Or: Hans Grethe" instead of performing a search.

    (This advice applies to everybody, not just new programmers. Failure to divide tasks into sub-tasks can result in spectacularly complicated debugging sessions where the error is finally traced to a supposedly "easy" part.)

  2. Modify the search method such that it responds correctly to "AND" input strings. For now, aim for the simplest and easiest solution you can think of, without worrying about performance. Do the same for "OR" input strings.
  3. (Optional.) Try to improve the design. The goal is to make searches faster and/or decrease the memory usage. You may want to look at the java.util.HashSet class or other classes from the Java Collections Framework.

Extensions

The following are suggestions for assignments to do after the mandatory part(s). The ordering of these suggestions is random.

  • Write a web crawler.
  • Use the extended data file.
  • Investigate how to easily support prefix-search. For example a search for "arm*" should find all pages with "arms", "armoury", "armadillo", "armageddon", "armistice" etc. This only requires an extra table of pointers to the words, where the table is ordered by words.
  • Create a graphical user interface.
  • Make the program more efficient in terms of usage of space (mostly for people studying Introduction to algorithms and data structures). Idea: Instead of having pointers to URL-strings, you can put the various URL-strings in a table. The pointers can now be replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don't know how large the URL-table has to be, you double it's size each time it runs full (this is described in CLRS). By using amortized analysis it can be shown that this doesn't increase the running time significantly. Instead of using a list of URL's with each word, a table of URL's represented as numbers can now be used. This also requires doubling techniques and amortized analysis. For words that occur in almost all pages a third strategy can be used; namely a "bit vector". The idea is that a word occurs in the page if and only if the corresponding bit has a value 1.
  • Construct your own hashCode method. See the reference to hashing under Data structures.
  • Find other data other than homepages to search through and modify your search engine to fit your situation. For example, you could be searching for grep patterns in a text file.
  • Design a tool that, given a web page x, will search for web pages similar to x.
  • Enable the user limit the search to URLs matching a pattern such as *.dk or itu.dk/dkm/*.
  • Combine two user-imposed constraints on searching, for example limited-domain search and search for a particular prefix.
  • Support complex boolean searches, for instance ((((a AND b) OR c) AND f OR b) OR g) AND NOT h.
  • If a search produces no result, it might be because the user performed an erroneous keystroke, so that, e.g., an "a" became an "s". In such cases you can choose to return pages which contain words almost matching the word searched for.
  • Support free text searches. That is, give the opportunity to search for (fragments of) sentences. A strategy is to implement a free-text search as an AND boolean search. This limits the search to a small amount of pages which are then examined in closer detail. Another approach is to calculate hash values for sentences instead of words. A third possibility is a more compact representation of the text. Instead of having the word repeated a certain number of times, you remember at which positions in the homepage the word occurs. Find your own favorite strategy.
  • How can the data structure be updated dynamically? You can choose to collect information on a lot of new homepages before updating. Another approach is to implement dynamic data structures such as dynamic search trees. The method chosen will be a compromise between space, search time and the time for updating the data structure.
  • Implement the joint part using vectors, collections etc. and compare this (these) solution(s) to the solution from the joint part where these classes were not used.

The report

The following are general requirements on the report:
  • The report must contain a foreword telling who has fulfilled the assignment, what the assignment contains, where the assignment is made (ITU), during what period, and who the teachers are. Furthermore, a short statement on whether the project went as planned is required. The foreword should not exceed one page.
  • The document "project-agreement" is used as the cover page for the assignment.

Furthermore there are the following requirements on the report that are related to the three search engines implemented in the joint part:

  • You have to describe how the different search engines work, which data structures are used and how they are used. This description must be in a natural language and Java code may only be used to support the description.
  • You have to document your investigation of how effective the different search engines are by running them.
  • You have to compare the different implementations to each other, by using your benchmarking results from the preceding clause and/or by applying algorithm analysis. People studying algorithms are expected to use O-notation in this comparison.
  • You have to document correctness testing of the search engines. Both an internal test (also called white box/structural test) and an external test (also called black box/functional test) must be performed and documented. A note on the subject of testing is available.
  • The code for your search engines and documentation of tests etc. must be handed in as appendices to the project report. In addition, your code must be handed in on a CD or USB stick.

What the report contains apart from the joint part, depends on the work done by the group and the chosen focus of this work.

There is a note on what a project report might look like. This note is recommended for preparation for this project. In that note, an estimate of 25 pages is made for a project report. For this concrete project our estimate is less than 20 pages using a standard font (20 being a large report). Neither appendices nor figures are counted towards the 20 pages. Experience shows that the following structure makes for a good report:

  • Cover page (the "project-agreement")
  • Foreword
  • Walk-through of SearchCmd 1
  • Walk-through of SearchCmd 2
  • Walk-through of SearchCmd 3
  • Walk-through of SearchCmd 4
  • Benchmarking results for the different search engines together with theoretical explanations of the results.
  • Tests
  • Extra assignments (including the advanced part if you carry it out)
The description of SearchCmd can be presented as follows:
  • Describe the Input/Output-relation ("What does the program do?") without reference to code in such a way that it could be read by anyone.
  • Describe the data structure without reference to the specific implementation.
  • Describe ideas, invariants etc. for the code.
  • Describe in general terms and without references to the code how it is structured.
  • Describe details of the code if needed.
  • Describe how much time and space the code uses.
  • Compare to previously described search engines.

Data structures

In order to complete the assignments in the joint part you need (an understanding of) linked lists and hashing.

Linked lists are explained at the first project day. If you don't know about linked lists you can take a look at this homepage. You can also look at ListProgram.java which contains a small example of the use of linked lists.

As supplementary material on hashing we recommend this homepage.


Administrative information

Signing up for the project

N.B.: Be aware that each group must consist of three or four persons.

To sign up for the project, your group must fill in a project agreement in the project base. The process is described in the Study Guide; see in particular here and here.

You will need the following information:

  • Project title (Danish): "Søgemaskineprojekt."
  • Project title (English): "Search Engine Project."
  • Language (report): English.
  • What will be handed in: Written report. Source code on CD or USB stick.
  • Project type: "7.5 ECTS".
  • Size in ECTS: "7.50".
  • Project period: "October 18, 2009—December 16, 2009, 15.00".
  • Examination language: English.

As for the method, you can write "Programming, testing, and performance measuring using the Java programming language" or similar.

As for the problem formulation, you can use the following text as a starting point (as well as the other descriptions on this page):

The goal of this project is to implement the core functionality of a simple, text-based search engine and study its performance. See the project cluster web page: http://www.itu.dk/courses/JK01/E2009/

In the first part of the project, the following assignments must be solved:

  1. Compile and run the program SearchCmd, supplied by the supervisors.
  2. Modify SearchCmd so that if the user requests a search for a particular word (say, "Hans"), the URLs of all pages where the word "Hans" occurs are written to the screen. This has to be done by changing the search method in SearchCmd. It is not done by changing the data structure.
  3. Modify the construction of the data structure in such a way that a linked list of objects containing three fields is created. The three fields are str, next and urls. The fields str and next are like in SearchCmd, with the exception that there are only words in the list and each word only occurs once. In case an object contains the word "Hans" in the field str, the object's urls field will be a pointer to a list of URLs of all pages that contain the word "Hans". After constructing this data structure, program the corresponding search procedure(s).
  4. Modify the solution from assignment 3 so as to use a hashtable instead of a linked list of words.

In the second part of the project, the task is to <...>

Exam

When the projects are handed in an exam plan will be made ASAP.

Teachers

Annex 1: Compiling and running a java program

Here is an example of how to compile and run a java program through the command line:

$ ls
    Example.java
$ javac Example.java
$ ls
     Example.java Example.class
$ java Example argument1 argument2 argument3
     Program output
 
 
Til toppen af siden