Search Engine Project

October 2004

Introductory programming (GP),
Introduction to Programming (IPBR),
Object Oriented Programming (OOP),
Introduction to Algorithms and Data Structures (IADS).


News!

Table of contents

This page: Other:

Introductory information

Welcome to the homepage of the Search Engine Project. The Search Engine Project is a 7.5 ECTS project on implementing a high-performance search-engine (think google).

The project is suitable for students at almost any level and of almost any ambition; the only requirement is basic programming skills, as acquired in one of the courses GP, IPBR, OOP and/or IADS. Simply by choosing the right ambitions, the beginning programmer will find an opportunity to implement a manageable yet real program; whereas the advanced student will find use for all of his/her knowledge and ingenuity.

Normally, all groups are required to solve the assignments mentioned in the section Joint part below. However, special arrangements can be made. When a group has completed the joint part, they have an initial version of a search engine.

After the joint part the group can move in one of several directions, depending on interest and skills. Below is a non-exhaustive list of suggestions. Students with only basic programming skills need not do more than the joint part; Groups with greater experience are expected to advance beyond the joint part.

Each group are required to do a project agreement. Each group will have to describe its work done in a report, and this report is defended at a project exam.

Project days are held during the project period; general information regarding the project will be dispensed and questions can be asked. Furthermore, meetings with each of the individual groups will be held. If you or your group has any problems during the project you are welcome to take up contact with one of the teachers.

At the first project day, the project will be presented, and the programming techniques required for solving the joint part will be presented. Therefore, in the description of the assignment below, the presence of unfamiliar terms should not discourage you. Parts of the text will be especially incomprehensible if you are unfamiliar with linked lists; linked lists are one of the programming techniques discussed at the first project day.

The first project day is monday 22/11, room 3A14.


The assignment

Data files

Several data files containing (parts of) ITU webpages are available for use as field data for the project. The following example illustrates the file format:

*PAGE:http://www.itu.dk/index.html
Here
is
a
word
and
one
word
more
*PAGE:http://www.itu.dk/anotherpage.html
Here
is
more
and
yet
more

The idea is that a data file contains information about a series of homepages. The information for a homepage starts with a *PAGE line which denotes at which address the page can be found. Such an address is called a URL. The list of words contained in the page follows the URL. The words are listed one word per line in the same order as they occur on the page. As seen in the example, each word can have several occurences on the same page.

You can get the following datafiles where the first two contains part of the homepages of ITU and the last one contains all of them:

The program SearchCmd

As part of the background material, the program SearchCmd.java is available. It implements a small search engine. The program takes the name of the data file used for searching as a parameter; refer to the section On the use of Java below for information on passing parameters to Java programs.

In the program, the given parameter is accessible in args[0] where args is the argument to the main method. The program then constructs a linked list (see data structures) with all the lines from the data file given in args[0]. The objects in the linked list contain the fields str and next. The field str contains a line from the data file and next points to the next object in the list. After reading the data file, a simple user interface launches, enabling the user to query the search engine: Given a word, the engine informs the user whether or not the word was present in the input.

The joint part

The joint part consists of solving the following 4 assignments. (Assignment 3 is by far the largest.) These assignments are presented at the first project day.
  1. Run the program SearchCmd.
  2. Modify SearchCmd so that if the user requests a search for a particular word (say, "Hans"), the URLs of all pages where the word "Hans" occurs are written to the screen. This has to be done by changing the search method in SearchCmd. It is not done by changing the data structure.
  3. Modify the construction of the data structure in such a way that a linked list of objects containing three fields is created. The three fields are str, next and urls. The fields str and next are like in SearchCmd, with the exception that there are only words in the list and each word only occurs once. In case an object contains the word "Hans" in the field str, the object's urls field will be a pointer to a list of URLs of all pages that contain the word "Hans". A list of URLs consists of objects with two fields; next and url. The field next is a pointer to the next object in the list and the field url is (a pointer to) a URL. After construction of this data structure, you have to program the corresponding search procedure(s). The data structure created in this assignment has smaller size than the structure from SearchCmdand is faster to search through.
  4. Modify the solution from assignment 3 so as to use a hashtable (see data structures) instead of a linked list of words. You can create the data structure using chained hashing. This means that each entry contains a reference to a linked list of URLs. The use of such a data structure will have a dramatic effect on the time it takes to build and the speed of search through the data structure.

In the solution of the above assignments it is not allowed to use any packages other than java.io and java.lang. As a consequence it is not allowed to use any of the classes from the java.util package: One of the purposes of this project is to learn how to program various data structures yourself, as opposed to just using data structures programmed by others.

As you can see, you are allowed to use different methods on String objects. It is also allowed to use the method hashCode and to allocate arrays. If you have any doubt whether some Java functionality is allowed for use or not, contact one of the teachers.

For the assignments after the joint part, there are no restrictions on the Java functionality you may use.

The report

The following are general requirements on the report:

Furthermore there are the following requirements on the report that are related to the three search engines implemented in the joint part:

What the report contains apart from the joint part, depends on the work done by the group and the chosen focus of this work.

There is a note on what a project report might look like. This note is recommended for preparation for this project. In that note, an estimate of 25 pages is made for a project report. For this concrete project our estimate is less than 20 pages using a standard font (20 being a large report). Neither appendices nor figures are counted towards the 20 pages. Our experience shows that the following structure makes for a good report:

The description of SearchCmd can be presented as follows: If you find it relevant you can have a chapter after the foreword, describing linked lists, general requirements on the input etc. which applies to all of the search engines described.

Data structures

In order to complete the assignments in the joint part you need (an understanding of) linked lists and hashing.

Linked lists are explained at the first project day. If you don't know about linked lists you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli, second edition, chapter 16.1 - 16.2. Alternatively take a look at this homepage. You can also get ListProgram.java which contain a small example of the use of linked lists. You can also learn about linked lists by experimenting with and modifying this program.

As supplementary material on hashing we recommend this homepage.

Extra assignments

The following is suggestions for assignments to do after the joint part. The ordering of these suggestions is random.


On the use of Java

In this part you can find some comments on the use of Java. For further and more basic information refer to Introductory programming and Java API specification.

When running SearchCmd, one may need to pass parameters to the main method. Furthermore it is necessary to use so-called jar-files for many of the extra assignments. The following section remind us how to do that in popular Java development environments.

Generally, jar-files contain functionality supplementary to that given by the standard implementation. A jar-file is necessary both when compiling and executing a program that uses that extra functionality. All the jar-files referred to on this homepage can also be found in the folder:

  s:\sysadm

BlueJ

In BlueJ a program is executed by running the main method. A dialogue box appears where you have to provide the program parameters. Here for example you can write (note the use of double backslash):

  {"s:\\sysadm\\soegemaskine\\itcwww-small.txt"}

meaning that you pass the above string to the program (i.e. it is placed in args[0]).

In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries. From here click Add and find the jar-file you need.

JCreator

If you use JCreator to compile and execute Java programs and encounter problems compiling and running a program, enter the File menu and choose Close workspace each time JCreator is started.

You can pass parameters to a program's main method as follows: Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value Run application and press <default>. Then press Edit and choose Parameters. Parameters can then be entered in the field Main (...).

You an use a jar-file by selecting Configure in the menu, then select Options and finally JDK Profiles. Choose JDK version 1.3 (or higher) and Edit. Press Add and choose Add package providing the jar-file to use.

Command line tools

The classical way of compiling and running Java programs is by using command line programs. You can compile the program SearchCmd.java by typing

  javac SearchCmd.java

Then run the program using the file itcwww-small.txt:

  java SearchCmd s:\sysadm\soegemaskine\itcwww-small.txt

If you want to use a jar-file, add it to the CLASSPATH. This is already done for the jar-files referred to on this page.


Administrative information

Important dates

Signing up for the project

Be aware that at least two participants are required in a group unless agreed otherwise with the teacher. We recommend at least 3 persons in a group.

This section describes in detail how to sign up for the project. These instructions pertain to the English version of the system. Complete the following steps:

  1. Go to the electronic project base https://adm.itu.dk/. If you do not have a password get one by clicking on the "get password" link. Follow the instructions and await the password in your e-mail.
  2. After logging in click on Project Base.
  3. Click on New Project.
  4. In the field Project agreement- common part, click on fill in and fill in the fields as follows:
  5. Click Save project agreement (common part) and subsequently click approve in the project agreement - common part field.
  6. In the field Participants - individual information, fill in the fields as follows:
  7. Click Save individual information and subsequently click approve under your individual information.
  8. Add each member of the group by entering his/her name, clicking add participant and filling out the form as described in 6.
  9. Make sure you invite all three teachers as supervisors: B. Biering 50%, S. Debois 50% and P. Bille 0%. (There are complicated administrative reasons for the latter.)
  10. The project agreement, if approved, will be returned to you by us shortly afterwards. As soon as you recieve it, follow the instructions to send it to the Study Board. Note that you have to send it to the Study Board before November 15.

Project exam

When the projects are handed in an exam plan will be made ASAP.

Each group must defend their report at an oral examination. Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.

Project Days

During the project period there will be a number of days with lectures and discussion of topics in the projects. Note that the ending time of each lecture/discussion may vary depending on the need for discussion and questions posed by the audience.

Project groups

Group 1: Christian Buch Iversen, Oxana Babikova & Stephan Spangenberg. Supervisor:Søren Debois.

Group 2: Ulrich Haslund, Morten Nordholt Andersen & Veronika Capskaja. Supervisor: Bodil Biering.

Group 3: Morten Hjorth Fæster & Jesper Mouritzen. Supervisor: Søren Debois.

Group 4: Harald Kvisli. Supervisor: Søren Debois.

Group 5: Ida Mia Olsen & Louise Ege. Supervisor: Bodil Biering.

Group 6: Mette Neel Jellum, Kasper Vaala & Eirikur Alfred Eli Hedinsson Mortensen. Supervisor: Bodil Biering.

All groups.

Teachers

Bodil Biering, Søren Debois, and Philip Bille.


updated November 20, 2004; debois.