Search Engine project for
Object Oriented Programming,
Introduction to Algorithms and Data Structures
The danish version of this page can be found here(will
not be updated!!!!)
Table of contents
Welcome to the homepage of the Search Engine Project.
The Search Engine Project is for students with different skills and
goals. All groups are required to solve the assignments mentioned
in the section The joint part below. Special arrangements
regarding the Joint part can be done with the teacher. When a group has solved
the joint part, they have the first primitive version of a search engine for IT-C.
After the joint part the group can move on
in several directions. For instance they can do one or more of the suggested
extra assignments. What each group choose to work on
will depend on interest and skills. In case the group has no other skills than
introductory programming it is not expected to do more than
the joint part during the project weeks. Groups with more skills are expected
to hand in more than the joint part.
Each group are required to do a project agreement.
Each group will have to describe its work done in a
report, and this report is defended at
a project exam.
During the project project days are held, where general information
regarding the project will be given and questions can be asked. Furthermore meetings with each of
the individual groups and the instructors will be held. If you or your group has any problems during
the project you are welcome to take contact to one of the teachers,
assigned for the course. There will also be an instructor for helping with programming problems
or other problems of a more specific character.
At the first project day the project is presented and it will be possible to create the final
groups, if this is not done already. Furthermore the programming techniques, which are needed for
solving the joint part are presented. Therefore in the describtion below of the assignment
there will supposedly be words you don't know before the first project day has been held. Parts
of the text will be especially incomprehensible if you haven't heard about linked lists. Linked
lists are one of the programming techniques presented at the first project day.
To use for the project different data files containing the whole or parts of the homepage of
IT-C is available. The format of the files can be illustrated by the following example:
The idea is, that a data file contains information about a series of homepages.
The information for a homepage starts with a *PAGE line which denotes
at which address the the page can be found. Such an address is called a URL. A
list of the words contained in the page follows the URL. The words are just listed
one word at a time on each line in the same order as they occur on the page. As seen
in the example each word can have several occurences on the same page.
You can get the following datafiles where the first two contains part of the homepages of
IT-C and the last one contains all of them:
These files can also be found in the folder:
and it is recommended to use them directly from there when on IT-C.
When working using Linux the files can be found in the folder /import/share/stud/sysadm/soegemaskine/
As part of this assignment the program
SearchCmd.java is handed out. It
implements a little search engine. From the command promt the name of the
data file to use for searching is provided as a
parameter to the program and the program is run. Refer to
about the use of Java below for information on how to
pass parameters to Java programs.
The parameter passed to the program is put in args
where args is the parameter to the programs main method.
The first thing the program does is to construct a linked list
datastructures with all the lines from the
data file which name is in args.
The objects in the linked list contains the fields str
and next. The field str contains a line from the data file
and next points to the next object in the list.
After reading the data file a simple user interface is started, enabling the user
to get an answer on whether a given word is in one of the homepages read from the
The joint part consists of solving the following 4 assignments, where assignment 3
is the largest. These assignments are presented at the first
- Run the program SearchCmd.
- Modify SearchCmd in such a way that if a search for "Hans" for instance occurs,
all of the URL's where "Hans" exists is written out to the screen. This has to be done
by changing the search-method in SearchCmd. It is not done by changing the
- Modify the construction of the datastructure in such a way that a linked list of
objects containing three fields is created. The three fields are str, next
and urls. The fields str and next are like in SearchCmd,
with the exception that there are only words in the list and each word only occurs once.
In case an object contains the word "Hans" in the field str the objects urls
field will be a pointer to a list of the URL's, wihch contains the word "Hans". A list of URL's
consists of objects with two fields; next and url. The field next is
a pointer to the next object in the list and the field url is a pointer to a URL.
After construction of this datastructure coresponding search routines has to be made.
The datastructure created in this assignment is of less size than the structure from SearchCmd
and is faster to search in.
- Modify the solution from assignment 3 in such a way that you in stead of having a linked list of
the different words, store them in a hashtable datastructures. You have to
create the datastructure using chained hashing. That mean that each entry contains a reference to a linked list.
Use of such a datastructure will have a dramatic effect on the time it takes to build and search in the datastructure.
In the solution of the above assignments it is not allowed to use any
other packages than java.io and java.lang.
As a consequence it is not allowed to use any of the classes from the
The reason for this demand is, that one of the purposes with this project is to
learn how to program different datastructures and not just to learn how to use
datastructures programmed by others.
As you can see it is allowed to use different methods on String
objects. It is also allowed to use the method hashCode. It is also
allowed to allocate arrays. If you have any doubt whether some Java functionality is
allowed for use, contact one of the teachers.
For the assignments after the joint part there are no restrictions on the
java functionality you must use.
There are the following general demants for the report:
The report must contain a foreword telling who has made the assignment,
what the assignment contains, where the assignment is made (IT-C), during
wich period and with wich instructors and main instructor. Futhermore a short
statement on whether the project went as planned is required. The foreword
must at most be one page.
The document "Projektaftale" is used as front page for the assignment.
Furthermore there is the following requirements to the report, which are
all related to the three search engines implemented in the joint part:
- You have to describe how the different search engines works, which datastructures
is used and how they are used. This desribtion must be in a natural language and java-code
must only be used to support the describtion.
- You have to document that you have investigated how effective the different search engines are by running them.
- You have to, for instance using the above point, compare the different implementations to each other. People studying
algorithms are expected to use O-notation in this comparison.
- If it is the first programming project the group is doing, you have to document that
you have tested the correctness of the search engines. An internal test (also called white box/structural)
has to be done. This test doesn't need to be fully documented. It is satisfactory to provide an example (for
instance an "if else"-construction. Furthermore an external test (also called black box/functional) is required. This test
must be documented completely. A note on the subject of testing has been made on
- As appendices code for the search engines and documentation for test etc. are handed in with the project report.
What the report contains apart from the joint part, depends on the work done by the group and the chosen focus on this work.
At Introductory programming there are made a note on how a project
report might look. This note is recommended reading as preparation for this course. In the note
an estimate of 25 pages are made for a project report. For this concrete project our estimate is
less than 20 pages using a standard font (20 being a large report). Appendices are not counted in
the 20 pages, neither are figures. Our experience shows that the following structure comprises a good
The desribtion of Searchmd ? can be done in the following way:
- Front page (the "Projektaftale")
- Run though of SearchCmd 1
- Run though of SearchCmd 2
- Run though of SearchCmd 3
- Run though of SearchCmd 4
- Results of time measures for the different search engines together with theoretical explanations of the results.
- Tests (not a requirement for all groups)
- Extra assignments
If you find it relevant you can have a chapter after the foreword, describing linked lists, general demands for the input etc. which
applies to all of the search engines described.
- Describe Input/Output without reference to code in such a way that it could be read by anyone.
- Describe the datastructure without reference to the specific implementation.
- Describe ideas, invariants etc. for the code.
- Describe in general terms and without references to the code how it is structured.
- Describe details in the code if needed.
- Describe how much time and space the code uses.
- Compare to previous described search engines.
In order to solve the assignments in the joint part
you need linked lists and hashing.
Linked lists are explained at the first
project day. If you don't know about linked lists
you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli,
second edition, chapter 16.1 - 16.2. Alternatively take a look at
You can also get ListProgram.java
which contain a small example of the use of linked lists. A real good way
to learn about linked lists is by experimenting and modifying this program.
As supplementary material on hashing we recommend the material that can be
found at this homepage.
The following is suggestions for assignments to do after
the joint part. The ordering of these suggestions
- Write a servlet. Read more about this at this homepage.
- Write a crawler. Read more about this at this homepage.
- Use the extended data file which can be found at this homepage.
- Investigate how to easily support prefix-search. For example a search for "house*" will find all pages with
"house", "houses", "houseboats" and so on.
This only requires an extra table of pointers to the words, where the table is ordered by words.
- Create a graphical userinterface.
- Make the program more efficient in terms of usage of space (mostly for people studying
Introduction to algorithms and datastructures).
Idea: Instead of having pointers to URL-strings, you can put the diferent URL-strings in a table. The pointers can now be
replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don't know how large the
URL-table has to be, you double it's size each time it runs full (this is described in CLRS). By using amortized analysis it can be
shown that this doesn't increase the running time significantly. Instead of using a list of URL's with each word, a table of URL's
represented as numbers can now be used. This again requires doubling techniques and amortized analysis. For words that is almost at
all pages a third strategy can be used; namely a "bitvector". The idea is that if and only if a bit is set, the word occurs at the
- Implement boolean search by supporting AND, OR etc. That is
"Hans" AND "Grethe" returns the pages where
"Hans" and "Grethe" occurs.
- Construct your own hashCode method. See the reference to hashing under Datastructures.
- Find other data to search among than homepages and modify your search engine to fit on this.
- Find similar homepages.
- Let the user limit the search only to be at some given URL's, for example search for
*.dk, eller it-c.dk/dkm/*
- Combine two user limitations on searching for example limited domainsearch and search for prefix.
- Support complex boolean searches, for instance
(((a AND b) AND c) AND f OR b) OR g.
- If a search has on result it mihgt be because the user has made a simple error stroke, in such
a way that an "a" became an "s". In such cases you can choose to return pages, which contains words
almost matching the word searched for.
- Make the personalized search engine. A user can register as a unique user.
The search engine then remembers, what kind of information the user is searching for
and uses. This information is then used to select the pages with the most relevant information.
For example Susanne from SONOFON wants pages containing information on telephones when searching
for the word "vibrator", while ...
- Support free text searches. That is give the opportunity to search on sentences. A strategy is
to implement a freetext search as an AND boolean search. This limits the search to a small amount of
pages which are then run through. Another approach is to calculate hasvalues for sentences instead of
words. A third possibility is a more compact representation of the text. Instead of having the word repeated
a certain number of times, you remember at which positions in the homepage the word occurs. Find your own
- Create a search engine that has child protection. You must be able to identify whether the user is
a child or not and to decide whether some page is for children or not. The first part can be solved
using passwords or by testing the user, ... The other part can be solved using approval by adults,
automated control etc.
- How can the datastructure be updated dynamically? You can choose to collect information
on a lot of new homepages before updating. Another approach is to implement dynamic datastructures
such as dynamic search trees. The last mentioned approach is mostly for people studying
"Introduction to Algorithms and Datastructures". The method choosen will be a compromise between
space, searchtime and time for building/updating the datastructure.
- Implement the joint part using vector, collections and so on and compare
this solution to the solution from the joint part where these classes weren't used.
- Find your own challenge.
In this part you can find some usefull information on the use of Java.
For further and more basic information refer to Introductory programming.
Furthermore there is a link to the Java API specification.
When running SearchCmd one
needs to pas parameters to the main method. Furthermore
it is necessary to use so-called jar-files for many of the extra assignments.
In the following section it is described how to do these two things in different
Java developing environments.
In general it can be mentioned that jar-files contains functionality
apart from the one given by the standard implementation. A jar file is
necessary both when compiling and executing a program that use the extra functionality.
All the jar-files that are refered to on this homepage can also be found in the folder:
In BlueJ a program is executed by running the main method. A dialogobx appears where you have
to provide the program parameters. Here for example you can write (beware of the use of double backslash):
meaning that you pass the above string to the program (i.e. it is placed in args
In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries.
From here click Add and find the correct jar-file that is needed.
If you use JCreator to translate and execute Java programs and encounters prooblems
compiling and running the programs, enter the File menu and choose Close workspace
each time JCreator is started.
You can pass parameters to a programs main-method in the following way:
Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value
Run application and press <default>. Then press Edit and choose Parameters.
Parameters can then be entered in the field Main (...).
You an use a jar-file by selecting Configure in the menu, then select Options
and finally JDK Profiles.
Choose JDK version 1.3 and Edit. Press Add and choose Add package providing the jar-file
The classical way of compiling and running java programs is by using the so-called commandline programs.
You can translate the program in SearchCmd.java in the following way:
Then running the program, using the small file, in the following way:
java SearchCmd s:\sysadm\soegemaskine\itcwww-small.txt
If you want to use a jar-file it is done by adding it to the CLASSPATH.
This is already done for the jar-files refered from this homepage.
- 24. of April at 11.00-14.00 in room 1.50. Presentation of the project, help
on forming groups and filling in the project agreement.
- 28. of April: Last day for sending the project agreement to the study board.
- 5. of May, at 9.00 till 12.00. First project day
- 30. of May, last day for handing in the report.
Signing up for the project
The easiest way to sign up for the project is to go to the presentation of
the project on 24. of april, where you can get help to find a group and fill
out the project agreement. Be aware that at least two participants are required
in a group unless otherwise is agreed on with the teacher. Our recommandation
is at least 3 persons in a group.
This section describes in detail how to sign up for the project. Do the following
- Go to the electronic project base https://adm.it-c.dk/ucs/pb.
If you do not have a password get one by clicking on the "get password"
link. Follow the instructions and await the password in your e-mail.
- Click on "nyt projekt".
- In the field "Projektaftale - f´lles del" click "udfyld"
and fill in the fields as follows:
- Projekttitel (dansk): "Søgemaskineprojekt."
- Projekttitel (engelsk): "Search Engine Project."
- Deltagerhæftelse: "kollektiv."
- Sprog (rapport): <Select the language that you prefer for you report.>
- Problemformulering: "See the homepage for the search engine project."
- Metode: "See the homepage for the search engine project."
- Hvad afleveres: "See the homepage for the search engine project."
- Click "Gem projektaftale (f´lles del)" and click "godkend"
in the projektaftale field.
- In the field "Deltagere - individuelle oplysninger" fill in the
fields as follows:
- Projekttype: "4 ugers projekt".
- Omfang i ECTS: "7.50".
- Projektperiode: "05/05-2003 -- Fredag 30. maj 2003 inden kl. 12.00".
- Eksamenssprog: <Select the language that you prefer for the exam.>
- Fagområder: <If you are a SWU-student and have not met the
"Softwareudvikling"-demand click this field.>
- Forudsætninger: <Write the technical courses that you have
- Click "Gem individuelle oplysninger".
- Add each menber of the group by clicking "tilføj deltager"
and filling out the form as described in 5.
- In the field "Vejledere" enter Philip Bille, click "tilføj
vejleder", select "V´lg" and click "tilføj
vejleder". Repeat these steps with the name Nina Bohr. This should give
you Philip Bille as "Hovedvejleder" and Nina Bohr as "Bivejleder".
Your group will be assigned a specific teacher later on. If you have special
wishes please let us know.
- Finally, click "inviter" on both Philip Bille and Nina Bohr.
- The project agreement will be sent back to you ASAP by the teachers. As
soon as you recieve it follow the instructions to sent to the study board.
Note that you have to sent it to the study board before 28. april.
The time for examination will be given later.
When the projects are handed in an exam plan will be made ASAP
Each group must defend their report at an oral examination.
Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.
- 1- person groups does a 10 minute presentation, followed by a 10-15 minute examination, summing to 20-25 minutes.
- 2- person groups does a 15 minute presentation, followed by a 10-15 minute examination, summing to 25-30 minutes.
- (3-4)- person groups does a 20 minute presentation, followed by a 15-20 minute examination, summing to 35-40 minutes.
During the project period there will be a number of days with lectures and
discussion of most of the topics the in projects.
- Monday the 5. of May, 10.00-12.00, room 1.60:
First project day. Linked lists and run through of assignment 3.
- Tuesday the 6. of May, 10.00-12.00, room 1.60:
Run through of assignment 4.
- Monday the 12. of May, 10.00-11.00, room 1.60:
Run through of assignment 4.
- Tuesday the 13. of May, 10.00-11.00, room 1.60:
Discussion about extra assignments.
- Monday the 19. of May, 10.00-1200, rooms 1.60:
Run through of assignment 4 and discussion about extra assignments.
Be aware that at least two participants are required in a group unless otherwise
is agreed on with the teacher. Our recommandation is at least 3 persons in a
group. The following groups have so far been created.
1 : Lise Gregersen (IADS, OOP, DBS), Tine Høiris Bak, Erik Matthias
2: Saba Jabeen (GP, ITPO, DBD), Susanne Hørby Christensen (IPBR,
ITPO, DBS), Sune Gregor Steenskjold (GP, DBD), Hesham Martin Khiery Mahoud El
Group 3: Kun Gao
(IADS, DS), Catalin Gabriel Oprea (OOP).
4: Laila Elisabeth Wærness (GP), Susanne Terkildsen (GP), Sanne Nygaard,
Marnie Mu-Tan Lai.
Group 5: Ulrik
Duerlund Olesen (IAD), Bue Pedersen.
Group 6: René Lund
Jensen (IAD, DS, PS), Wala Hussein Hamoud Alshahayib.
7: Mica Roswall Pedersen (GP), Rikke Elisabeth Scheibel, Andreas Schroll-Fleischer.
8: Torsten Skjødt (NP, IPBR), Jesper Bugge Jensen (GP), Martin Olsen
(GP, DBS), Benjamin Quorning (IPBR, IADS).
Groups 1,3,5,6,8 : will be supervised by Philip Bille.
Groups 2,4,7 : will be supervised by Nina Bohr.
teachers and students
Philip Bille (Teacher), email:email@example.com
Nina Bohr (Teacher), email:firstname.lastname@example.org
Jacob Borella (Teaching assistant), email: email@example.com
The teaching assistant is in room 3.08 or in the terminal rooms. He can be
found at the following times (changes may occur, but will be given at least
24 hours in advance):