Search Engine Project
December 2005
Introductory programming (GP),
Introduction to Programming (IPBR),
Object Oriented Programming (OOP),
Introduction to Algorithms and Data Structures (IADS).
Table of contents
This page:
Other:
Welcome to the homepage of the Search Engine Project. The Search
Engine Project is a 7.5 ECTS project on implementing a
high-performance search-engine (think google).
The project is suitable for students at almost any level and of
almost any ambition; the only requirement is basic programming
skills, as acquired in one of the courses GP, IPBR, OOP and/or
IADS. Simply by choosing the right ambitions, the beginning
programmer will find an opportunity to implement a manageable yet
real program; whereas the advanced student will find use for all
of his/her knowledge and ingenuity.
Normally, all groups are required to solve the assignments
mentioned in the section Joint part
below. However, special arrangements can be made. When a group
has completed the joint part, they have an initial version of a
search engine.
After the joint part the group can move
in one of several directions, depending on interest and skills. Below is a non-exhaustive list of suggestions.
Students with only basic programming skills need not do more than the joint part; Groups with greater experience
are expected to advance beyond the joint
part.
Each group are required to do a project
agreement. Each group will have to describe its work done in a report, and this report is defended at a project exam.
Project days are held during the project
period; general information regarding the project will be
dispensed and questions can be asked. Furthermore, meetings with each
of the individual groups will be held. If you or your group has any
problems during the project you are welcome to take up contact with
one of the teachers.
At the first project day, the project will be presented, and
the programming techniques required for solving the
joint part will be presented. Therefore, in the description of the assignment below, the presence of unfamiliar
terms should not discourage you. Parts of the text will be especially
incomprehensible if you are unfamiliar with linked lists; linked lists
are one of the programming techniques discussed at the first project
day.
Several data files containing (parts of) ITU webpages are available
for use as field data for the project. The following example illustrates the file format:
*PAGE:http://www.itu.dk/index.html
Here
is
a
word
and
one
word
more
*PAGE:http://www.itu.dk/anotherpage.html
Here
is
more
and
yet
more
The idea is that a data file contains information about a series of homepages.
The information for a homepage starts with a *PAGE line which denotes
at which address the page can be found. Such an address is called a URL. The
list of words contained in the page follows the URL. The words are listed
one word per line in the same order as they occur on the page. As seen
in the example, each word can have several occurences on the same page.
You can get the following datafiles where the first two contains part of the homepages of
ITU and the last one contains all of them:
As part of the background material, the program
SearchCmd.java is available. It
implements a small search engine. The program takes the name of the
data file used for searching as a
parameter; refer to the section On the use of
Java below for information on passing parameters to Java
programs.
In the program, the given parameter is accessible in
args[0] where args is the argument to the
main method. The program then constructs a linked list (see
data structures) with all the lines from
the data file given in args[0]. The
objects in the linked list contain the fields str and
next. The field str contains a line from the data
file and next points to the next object in the list. After
reading the data file, a simple user interface launches, enabling the
user to query the search engine: Given a word, the engine informs the user whether or not the word was present in the input.
The joint part consists of solving the following 4
assignments. (Assignment 3 is by far the largest.)
These assignments are presented at the first project day.
- Run the program SearchCmd.
- Modify SearchCmd so that if the user requests a search for a
particular word (say, "Hans"), the URLs of all pages where the word
"Hans" occurs are written to the screen. This has to be done
by changing the search method in SearchCmd. It is not done by changing the data structure.
- Modify the construction of the data structure in such a way that a linked
list of objects containing three fields is created. The three fields are str,
next and urls. The fields str and next
are like in SearchCmd, with the exception that there are only words
in the list and each word only occurs once. In case an object contains the
word "Hans" in the field str, the object's urls field will
be a pointer to a list of URLs of all pages that contain the word "Hans". A list
of URLs consists of objects with two fields; next and url.
The field next is a pointer to the next object in the list and the
field url is (a pointer to) a URL. After construction of this data structure, you have to program the
corresponding search procedure(s). The data structure created in
this assignment has smaller size than the structure from SearchCmdand
is faster to search through.
- Modify the solution from assignment 3 so as to use a hashtable (see data structures) instead of a linked list of words.
You can
create the data structure using chained hashing. This means that each entry contains a reference to a linked list of URLs.
The use of such a data structure will have a dramatic effect on the time it takes to build and the speed of search through the data structure.
In the solution of the above assignments it is not allowed to use any
packages other than java.io and java.lang.
As a consequence it is not allowed to use any of the classes from the
java.util package: One of the purposes of this project is to
learn how to program various data structures yourself, as opposed to just using
data structures programmed by others.
As you can see, you are allowed to use different methods on String
objects. It is also allowed to use the method hashCode and to allocate arrays. If you have any doubt whether some Java functionality is
allowed for use or not, contact one of the teachers.
For the assignments after the joint part, there are no restrictions on the
Java functionality you may use.
The following are general requirements on the report:
-
The report must contain a foreword telling who has fulfilled the assignment,
what the assignment contains, where the assignment is made (ITU), during
what period, and who the teachers are. Futhermore, a short
statement on whether the project went as planned is required. The foreword
should not exceed one page.
-
The document "Projektaftale" is used as the cover page for the assignment.
Furthermore there are the following requirements on the report that are
related to the three search engines implemented in the joint part:
- You have to describe how the different search engines work, which data structures
are used and how they are used. This description must be in a natural language and Java code
may only be used to support the description.
- You have to document your investigation of how effective the different search engines are by running them.
- You have to compare the different implementations to each other,
by using your benchmarking results from the preceding clause
and/or by applying algorithm analysis. People studying
algorithms are expected to use O-notation in this comparison.
- If it is the first programming project the group is doing, you have to document correctness testing of the search engines. An internal test (also called white box/structural)
has to be performed. This test doesn't need to be fully documented. It is sufficient to provide an example (for
instance an "if else"-construction. Furthermore an external test (also called black box/functional) is required. This test
must be documented completely. A note on the subject of testing is available.
- the code for your search engines and documentation of tests etc. are handed in as appendices to the project report.
What the report contains apart from the joint part, depends on the work done by the group and the chosen focus of this work.
There is a note on what a project
report might look like. This note is recommended for preparation for this project. In that note,
an estimate of 25 pages is made for a project report. For this concrete project our estimate is
less than 20 pages using a standard font (20 being a large report). Neither appendices nor figures are counted towards
the 20 pages. Our experience shows that the following structure makes for a good
report:
- Cover page (the "Projektaftale")
- Foreword
- Walk-through of SearchCmd 1
- Walk-through of SearchCmd 2
- Walk-through of SearchCmd 3
- Walk-through of SearchCmd 4
- Benchmarking results for the different search engines together with theoretical explanations of the results.
- Tests (not a requirement for all groups)
- Extra assignments
The description of SearchCmd can be presented as follows:
- Describe the Input/Output-relation ("What does the program do?") without reference to code in such a way that it could be read by anyone.
- Describe the data structure without reference to the specific implementation.
- Describe ideas, invariants etc. for the code.
- Describe in general terms and without references to the code how it is structured.
- Describe details of the code if needed.
- Describe how much time and space the code uses.
- Compare to previously described search engines.
If you find it relevant you can have a chapter after the foreword, describing linked lists, general requirements on the input etc. which
applies to all of the search engines described.
In order to complete the assignments in the joint part
you need (an understanding of) linked lists and hashing.
Linked lists are explained at the first
project day. If you don't know about linked lists
you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli,
second edition, chapter 16.1 - 16.2. Alternatively take a look at
this homepage.
You can also get ListProgram.java
which contain a small example of the use of linked lists. You can also learn about linked lists by experimenting with and modifying this program.
As supplementary material on hashing we recommend this homepage.
The following is suggestions for assignments to do after
the joint part. The ordering of these suggestions
is random.
- Write a servlet.
- Write a crawler.
- Use the extended data file.
- Investigate how to easily support prefix-search. For example a search for "arm*" should find all pages with "arms",
"armoury", "armadillo", "armageddon", "armistice" etc.
This only requires an extra table of pointers to the words, where the table is ordered by words.
- Create a graphical user interface.
- Make the program more efficient in terms of usage of space (mostly for people studying Introduction to algorithms and data structures).
Idea: Instead of having pointers to URL-strings, you can put the various URL-strings in a table. The pointers can now be
replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don't know how large the
URL-table has to be, you double it's size each time it runs full (this is described in CLRS). By using amortized analysis it can be
shown that this doesn't increase the running time significantly. Instead of using a list of URL's with each word, a table of URL's
represented as numbers can now be used. This also requires doubling techniques and amortized analysis. For words that occur in almost
all pages a third strategy can be used; namely a "bitvector". The idea is that a word occurs in the page if and only if the corresponding bit has a value 1.
- Implement boolean search by supporting AND, OR etc. That is,
"Hans" AND "Grethe" returns the (list of) pages where both words
"Hans" and "Grethe" occur.
- Construct your own hashCode method. See the reference to hashing under Data structures.
- Find other data other than homepages
to search through and modify your search engine to fit your situation.
For example, you could be searching for grep patterns in a text file.
- Design a tool that, given a webpage x,
will search for webpages similar to x.
- Enable the user limit the search to URLs matching a pattern such as
*.dk or itu.dk/dkm/*.
- Combine two user-imposed constraints on searching, for example limited-domain search and search for a particular prefix.
- Support complex boolean searches, for instance
((((a AND b) OR c) AND f OR b) OR g) AND NOT *microsoft*.
- If a search produces no result, it might be because the user performed an erroneous keystroke, so that, e.g., an "a" became an "s". In such cases you can choose to return pages which contain words almost matching the word searched for.
- Support free text searches. That is, give the opportunity to search for (fragments of) sentences. A strategy is
to implement a freetext search as an AND boolean search. This limits the search to a small amount of
pages which are then examined in closer detail. Another approach is to calculate hashvalues for sentences instead of
words. A third possibility is a more compact representation of the text. Instead of having the word repeated
a certain number of times, you remember at which positions in the homepage the word occurs. Find your own
favorite strategy.
- How can the data structure be updated dynamically? You can choose to collect information
on a lot of new homepages before updating. Another approach is to implement dynamic data structures
such as dynamic search trees. The last mentioned approach is mostly for people studying
"Introduction to Algorithms and Data structures". The method choosen will be a compromise between
space, searchtime and the time for updating the data structure.
- Implement the joint part using vectors, collections etc. and compare
this (these) solution(s) to the solution from the joint part where these classes were not used.
- Find your own challenge.
In this part you can find some comments on the use of Java.
For further and more basic information refer to Introductory programming and Java API specification.
BlueJ
In BlueJ a program is executed by running the main method. A dialogue box appears where you have
to provide the program parameters. Here for example you can write (note the use of double backslash):
{"u:\\esben\\searchengine\\itcwww-small.txt"}
meaning that you pass the above string to the program (i.e. it is placed in args[0]).
In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries.
From here click Add and find the jar-file you need.
JCreator
If you use JCreator to compile and execute Java programs and encounter problems
compiling and running a program, enter the File menu and choose Close workspace
each time JCreator is started.
You can pass parameters to a program's main method as follows:
Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value
Run application and press <default>. Then press Edit and choose Parameters.
Parameters can then be entered in the field Main (...).
You an use a jar-file by selecting Configure in the menu, then select Options
and finally JDK Profiles.
Choose JDK version 1.3 (or higher) and Edit. Press Add and choose Add package providing the jar-file
to use.
Command line tools
The classical way of compiling and running Java programs is by using command line programs.
You can compile the program SearchCmd.java by typing
javac SearchCmd.java
Then run the program using the file itcwww-small.txt:
java SearchCmd u:\esben\searchengine\itcwww-small.txt
If you want to use a jar-file, add it to the CLASSPATH.
This is already done for the jar-files referred to on this page.
Be aware that at least two participants are required
in a group unless agreed otherwise with the teacher. We recommend at least
3 persons in a group.
This section describes in detail how to sign up for the project. These instructions
pertain to the English version of the system. Complete the following steps:
- Go to the electronic project base https://adm.itu.dk/.
If you do not have a password get one by clicking on the "get password"
link. Follow the instructions and await the password in your e-mail.
- After logging in click on Project Base.
- Click on New Project.
- In the field Project agreement- common part, click on fill
in and fill in the fields as follows:
- Project title (danish): "Søgemaskineprojekt."
- Project title (english): "Search Engine Project."
- Non-disclosure agreement: "No"
- Liability: "Collective."
- Language (report): <Select the language that you prefer for your report.>
- Problem formulation: <See the homepage for the search engine project.>
- Method: <See the homepage for the search engine project.>
- What will be handed in: <See the homepage for the search engine
project.>
- Click Save project agreement (common part) and subsequently
click approve in the project agreement - common part field.
- In the field Participants - individual information, fill in
the fields as follows:
- Project type: "4-week project".
- Size in ECTS: "7.50".
- Project period: "November 28, 2005—December 23, 2005, 15.00".
- Examination language: <Select the language that you prefer for the
exam.>
- Subject areas: <Select the fields that you would like to
cover.>
- Qualifications: <Write down the technical courses that you have
followed. E.g. Introductory programming, IADS, OOP, Distributed Systems,
etc.>
- Click Save individual information and subsequently click approve under your individual information.
- Add each member of the group by entering his/her name, clicking add
participant and filling out the form as described in 6.
- Make sure you invite both teachers as supervisors: Esben Rune Hansen 50% and Milan Ruzic 50%.
- The project agreement, if approved, will be returned to you by us shortly afterwards. As soon as you
recieve it, follow the instructions to send it to the Study Board. Note that
you have to send it to the Study Board before April 25.
When the projects are handed in an exam plan will be made ASAP.
Each group must defend their report at an oral examination.
Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.
- 1-person groups do a 10 minute presentation, followed by a 10-15 minute examination, taking up 20-25 minutes in total.
- 2-person groups do a 15 minute presentation, followed by a 10-15 minute examination, taking up 25-30 minutes in total.
- (3-4)-person groups do a 20 minute presentation, followed by a 15-20 minute examination, taking up 35-40 minutes in total.
Esben Rune Hansen
and Milan Ruzic.
updated November 2, 2005 by Esben Rune Hansen.