Search Engine project for
Introductory programming,
Object Oriented Programming,
Database Systems,
Concurrency and
Introduction to Algorithms and Data Structures
December 2003
The danish version of this page can be found here(will
not be updated!!!!)
Table of contents
This page:
Other:
Welcome to the homepage of the Search Engine Project.
The Search Engine Project is for students with different skills and
goals. All groups are required to solve the assignments mentioned
in the section The joint part below. Special arrangements
regarding the Joint part can be done with the teacher. When a group has solved
the joint part, they have the first primitive version of a search engine for IT-C.
After the joint part the group can move on
in several directions. For instance they can do one or more of the suggested
extra assignments. What each group choose to work on
will depend on interest and skills. In case the group has no other skills than
introductory programming it is not expected to do more than
the joint part during the project weeks. Groups with more skills are expected
to hand in more than the joint part.
Each group are required to do a project agreement.
Each group will have to describe its work done in a
report, and this report is defended at
a project exam.
During the project project days are held, where general information
regarding the project will be given and questions can be asked. Furthermore meetings with each of
the individual groups and the instructors will be held. If you or your group has any problems during
the project you are welcome to take contact to one of the teachers,
assigned for the course. There will also be an instructor for helping with programming problems
or other problems of a more specific character.
At the first project day the project is presented and it will be possible to create the final
groups, if this is not done already. Furthermore the programming techniques, which are needed for
solving the joint part are presented. Therefore in the describtion below of the assignment
there will supposedly be words you don't know before the first project day has been held. Parts
of the text will be especially incomprehensible if you haven't heard about linked lists. Linked
lists are one of the programming techniques presented at the first project day.
To use for the project different data files containing the whole or parts of the homepage of
IT-C is available. The format of the files can be illustrated by the following example:
*PAGE:http://www.it-c.dk
Here
is
a
word
and
one
word
more
*PAGE:http://www.it-c.dk/anotherpage.html
Here
is
more
and
yet
more
The idea is, that a data file contains information about a series of homepages.
The information for a homepage starts with a *PAGE line which denotes
at which address the the page can be found. Such an address is called a URL. A
list of the words contained in the page follows the URL. The words are just listed
one word at a time on each line in the same order as they occur on the page. As seen
in the example each word can have several occurences on the same page.
You can get the following datafiles where the first two contains part of the homepages of
IT-C and the last one contains all of them:
These files can also be found in the folder:
s:\sysadm\soegemaskine
and it is recommended to use them directly from there when on IT-C.
When working using Linux the files can be found in the folder /import/share/stud/sysadm/soegemaskine/
As part of this assignment the program
SearchCmd.java is handed out. It
implements a little search engine. From the command promt the name of the
data file to use for searching is provided as a
parameter to the program and the program is run. Refer to
about the use of Java below for information on how to
pass parameters to Java programs.
The parameter passed to the program is put in args[0]
where args is the parameter to the programs main method.
The first thing the program does is to construct a linked list
datastructures with all the lines from the
data file which name is in args[0].
The objects in the linked list contains the fields str
and next. The field str contains a line from the data file
and next points to the next object in the list.
After reading the data file a simple user interface is started, enabling the user
to get an answer on whether a given word is in one of the homepages read from the
data file.
The joint part consists of solving the following 4 assignments, where assignment 3
is the largest. These assignments are presented at the first
project day.
- Run the program SearchCmd.
- Modify SearchCmd in such a way that if a search for "Hans" for instance occurs,
all of the URL's where "Hans" exists is written out to the screen. This has to be done
by changing the search-method in SearchCmd. It is not done by changing the
datastructure.
- Modify the construction of the datastructure in such a way that a linked
list of objects containing three fields is created. The three fields are str,
next and urls. The fields str and next
are like in SearchCmd, with the exception that there are only words
in the list and each word only occurs once. In case an object contains the
word "Hans" in the field str the objects urls field will
be a pointer to a list of the URL's, wihch contains the word "Hans". A list
of URL's consists of objects with two fields; next and url.
The field next is a pointer to the next object in the list and the
field url is a pointer to a URL. After construction of this datastructure
coresponding search routines has to be made. The datastructure created in
this assignment is of less size than the structure from Searchcmdand
is faster to search in.
- Modify the solution from assignment 3 in such a way that you in stead of having a linked list of
the different words, store them in a hashtable datastructures. You have to
create the datastructure using chained hashing. That mean that each entry contains a reference to a linked list.
Use of such a datastructure will have a dramatic effect on the time it takes to build and search in the datastructure.
In the solution of the above assignments it is not allowed to use any
other packages than java.io and java.lang.
As a consequence it is not allowed to use any of the classes from the
java.util package.
The reason for this demand is, that one of the purposes with this project is to
learn how to program different datastructures and not just to learn how to use
datastructures programmed by others.
As you can see it is allowed to use different methods on String
objects. It is also allowed to use the method hashCode. It is also
allowed to allocate arrays. If you have any doubt whether some Java functionality is
allowed for use, contact one of the teachers.
For the assignments after the joint part there are no restrictions on the
java functionality you must use.
There are the following general demants for the report:
-
The report must contain a foreword telling who has made the assignment,
what the assignment contains, where the assignment is made (IT-C), during
wich period and with wich instructors and main instructor. Futhermore a short
statement on whether the project went as planned is required. The foreword
must at most be one page.
-
The document "Projektaftale" is used as front page for the assignment.
Furthermore there is the following requirements to the report, which are
all related to the three search engines implemented in the joint part:
- You have to describe how the different search engines works, which datastructures
is used and how they are used. This desribtion must be in a natural language and java-code
must only be used to support the describtion.
- You have to document that you have investigated how effective the different search engines are by running them.
- You have to, for instance using the above point, compare the different implementations to each other. People studying
algorithms are expected to use O-notation in this comparison.
- If it is the first programming project the group is doing, you have to document that
you have tested the correctness of the search engines. An internal test (also called white box/structural)
has to be done. This test doesn't need to be fully documented. It is satisfactory to provide an example (for
instance an "if else"-construction. Furthermore an external test (also called black box/functional) is required. This test
must be documented completely. A note on the subject of testing has been made on
Introductory programming.
- As appendices code for the search engines and documentation for test etc. are handed in with the project report.
What the report contains apart from the joint part, depends on the work done by the group and the chosen focus on this work.
At Introductory programming there are made a note on how a project
report might look. This note is recommended reading as preparation for this course. In the note
an estimate of 25 pages are made for a project report. For this concrete project our estimate is
less than 20 pages using a standard font (20 being a large report). Appendices are not counted in
the 20 pages, neither are figures. Our experience shows that the following structure comprises a good
report:
- Front page (the "Projektaftale")
- Foreword
- Run though of SearchCmd 1
- Run though of SearchCmd 2
- Run though of SearchCmd 3
- Run though of SearchCmd 4
- Results of time measures for the different search engines together with theoretical explanations of the results.
- Tests (not a requirement for all groups)
- Extra assignments
The desribtion of Searchmd ? can be done in the following way:
- Describe Input/Output without reference to code in such a way that it could be read by anyone.
- Describe the datastructure without reference to the specific implementation.
- Describe ideas, invariants etc. for the code.
- Describe in general terms and without references to the code how it is structured.
- Describe details in the code if needed.
- Describe how much time and space the code uses.
- Compare to previous described search engines.
If you find it relevant you can have a chapter after the foreword, describing linked lists, general demands for the input etc. which
applies to all of the search engines described.
In order to solve the assignments in the joint part
you need linked lists and hashing.
Linked lists are explained at the first
project day. If you don't know about linked lists
you can refer to "Java, Java, Java: object oriented problem solving" by Ralph Morelli,
second edition, chapter 16.1 - 16.2. Alternatively take a look at
this homepage.
You can also get ListProgram.java
which contain a small example of the use of linked lists. A real good way
to learn about linked lists is by experimenting and modifying this program.
As supplementary material on hashing we recommend the material that can be
found at this homepage.
The following is suggestions for assignments to do after
the joint part. The ordering of these suggestions
is random.
- Write a servlet. Read more about this at this homepage.
- Write a crawler. Read more about this at this homepage.
- Use the extended data file which can be found at this homepage.
- Investigate how to easily support prefix-search. For example a search for "house*" will find all pages with
"house", "houses", "houseboats" and so on.
This only requires an extra table of pointers to the words, where the table is ordered by words.
- Create a graphical userinterface.
- Make the program more efficient in terms of usage of space (mostly for people studying
Introduction to algorithms and datastructures).
Idea: Instead of having pointers to URL-strings, you can put the diferent URL-strings in a table. The pointers can now be
replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don't know how large the
URL-table has to be, you double it's size each time it runs full (this is described in CLRS). By using amortized analysis it can be
shown that this doesn't increase the running time significantly. Instead of using a list of URL's with each word, a table of URL's
represented as numbers can now be used. This again requires doubling techniques and amortized analysis. For words that is almost at
all pages a third strategy can be used; namely a "bitvector". The idea is that if and only if a bit is set, the word occurs at the
homepage.
- Implement boolean search by supporting AND, OR etc. That is
"Hans" AND "Grethe" returns the pages where
"Hans" and "Grethe" occurs.
- Construct your own hashCode method. See the reference to hashing under Datastructures.
- Find other data to search among than homepages and modify your search engine to fit on this.
- Find similar homepages.
- Let the user limit the search only to be at some given URL's, for example search for
*.dk, eller it-c.dk/dkm/*
- Combine two user limitations on searching for example limited domainsearch and search for prefix.
- Support complex boolean searches, for instance
(((a AND b) AND c) AND f OR b) OR g.
- If a search has on result it mihgt be because the user has made a simple error stroke, in such
a way that an "a" became an "s". In such cases you can choose to return pages, which contains words
almost matching the word searched for.
- Make the personalized search engine. A user can register as a unique user.
The search engine then remembers, what kind of information the user is searching for
and uses. This information is then used to select the pages with the most relevant information.
For example Susanne from SONOFON wants pages containing information on telephones when searching
for the word "vibrator", while ...
- Support free text searches. That is give the opportunity to search on sentences. A strategy is
to implement a freetext search as an AND boolean search. This limits the search to a small amount of
pages which are then run through. Another approach is to calculate hasvalues for sentences instead of
words. A third possibility is a more compact representation of the text. Instead of having the word repeated
a certain number of times, you remember at which positions in the homepage the word occurs. Find your own
favorite strategy.
- Create a search engine that has child protection. You must be able to identify whether the user is
a child or not and to decide whether some page is for children or not. The first part can be solved
using passwords or by testing the user, ... The other part can be solved using approval by adults,
automated control etc.
- How can the datastructure be updated dynamically? You can choose to collect information
on a lot of new homepages before updating. Another approach is to implement dynamic datastructures
such as dynamic search trees. The last mentioned approach is mostly for people studying
"Introduction to Algorithms and Datastructures". The method choosen will be a compromise between
space, searchtime and time for building/updating the datastructure.
- Implement the joint part using vector, collections and so on and compare
this solution to the solution from the joint part where these classes weren't used.
- Find your own challenge.
In this part you can find some usefull information on the use of Java.
For further and more basic information refer to Introductory programming.
Furthermore there is a link to the Java API specification.
When running SearchCmd one
needs to pas parameters to the main method. Furthermore
it is necessary to use so-called jar-files for many of the extra assignments.
In the following section it is described how to do these two things in different
Java developing environments.
In general it can be mentioned that jar-files contains functionality
apart from the one given by the standard implementation. A jar file is
necessary both when compiling and executing a program that use the extra functionality.
All the jar-files that are refered to on this homepage can also be found in the folder:
s:\sysadm
BlueJ
In BlueJ a program is executed by running the main method. A dialogobx appears where you have
to provide the program parameters. Here for example you can write (beware of the use of double backslash):
{"s:\\sysadm\\soegemaskine\\itcwww-small.txt"}
meaning that you pass the above string to the program (i.e. it is placed in args[0]
In BlueJ you can use .jar files by selecting Options -> Preferences ->Libraries.
From here click Add and find the correct jar-file that is needed.
JCreator
If you use JCreator to translate and execute Java programs and encounters prooblems
compiling and running the programs, enter the File menu and choose Close workspace
each time JCreator is started.
You can pass parameters to a programs main-method in the following way:
Choose Configure -> Options -> JDK Tools. In the field Select tools type select the value
Run application and press <default>. Then press Edit and choose Parameters.
Parameters can then be entered in the field Main (...).
You an use a jar-file by selecting Configure in the menu, then select Options
and finally JDK Profiles.
Choose JDK version 1.3 and Edit. Press Add and choose Add package providing the jar-file
to use.
Commandline tools
The classical way of compiling and running java programs is by using the so-called commandline programs.
You can translate the program in SearchCmd.java in the following way:
javac SearchCmd.java
Then running the program, using the small file, in the following way:
java SearchCmd s:\sysadm\soegemaskine\itcwww-small.txt
If you want to use a jar-file it is done by adding it to the CLASSPATH.
This is already done for the jar-files refered from this homepage.
- 23. of October 16.00-19.00 in room 0.10. Poster presentation of the project.
- 10. of November 10.00-11.00 in room 0.05. Presentation of project, help
on forming groups and help on signing up for the project.
- 17. of November. Deadline for handing in the project agreement.
- 24. of November. Start of project. First project
day.
- 19. of December 12.00. Deadline for handing in the project.
-
Signing up for the project
The easiest way to sign up for the project is to go to the presentation of
the project on 10. of November, where you can get help to find a group and fill
out the project agreement. Be aware that at least two participants are required
in a group unless otherwise is agreed on with the teacher. We recommend at least
3 persons in a group.
This section describes in detail how to sign up for the project. These instruction
are given for the english version of the system. Do the following steps:
- Go to the electronic project base https://adm.it-c.dk/.
If you do not have a password get one by clicking on the "get password"
link. Follow the instructions and await the password in your e-mail.
- After logging on click on Project Base.
- Click on "new project".
- In the field "Project agreement- common part" click "fill
in " and fill in the fields as follows:
- Project title(danish): "Søgemaskineprojekt."
- Project title (english): "Search Engine Project."
- Non disclosure agreement: "No"
- Liability: "Collective."
- Language (report): <Select the language that you prefer for your
report.>
- Problemformulation: "See the homepage for the search engine project."
- Method: "See the homepage for the search engine project."
- What will be handed in: "See the homepage for the search engine
project."
- Click "Save project agreement (common part)" and subsequently
click "approve" in the project agreement - common part field.
- In the field "Participants - individual information " fill in
the fields as follows:
- Project type: "4-week project".
- Size in ECTS: "7.50".
- Project period: "2003-11-24 -- 19 December before 12.00".
- Examination language: <Select the language that you prefer for the
exam.>
- Subject areas: <Select the fields that you would like to cover.>
- Qualifications: <Write the technical courses that you have followed.
E.g. Introductory programming, IAD, OOP, Distributed Systems, etc.>
- Click "Save individual information" and subsequently click "approve"
under your individual information.
- Add each menber of the group by entering the name of him/her, clicking "add
participant" and filling out the form as described in 6.
- In the field "Supervisor" enter Philip Bille, click "add
supervisor", select "choose" to the left of Philip Bille and
click "add supervisor".
- Finally, click "invite" on Philip Bille.
- The project agreement will be sent back to you ASAP by me. As soon as you
recieve it follow the instructions to sent it to the study board. Note that
you have to sent it to the study board before 17. of November.
The time for examination will be given later.
When the projects are handed in an exam plan will be made ASAP
Exam plan
Each group must defend their report at an oral examination.
Based on the report an the oral examiation a character based on the 13-scale is given for each participant in the group.
- 1- person groups does a 10 minute presentation, followed by a 10-15 minute examination, summing to 20-25 minutes.
- 2- person groups does a 15 minute presentation, followed by a 10-15 minute examination, summing to 25-30 minutes.
- (3-4)- person groups does a 20 minute presentation, followed by a 15-20 minute examination, summing to 35-40 minutes.
During the project period there will be a number of days with lectures and
discussion of most of the topics the in projects. Note that the ending time
of each lecture/discussion is may vary depending on the need for discussion
and questions posed by the audience.
- Monday the 24. of November, 10.00-12.00, room 1.60:
First project day. Linked lists and run through of assignment 3.
- Tuesday the 25. of November, 10.00-12.00, room 1.60:
Run through of assignment 4.
- Monday the 1. of December, 10.00-12.00, room 1.60:
Run through of assignment 4.
- Tuesday the 2. of December, 10.00-12.00, room 1.60:
Discussion about extra assignments.
- Monday the 8. of December, 10.00-12.00, room 1.60:
Run through of assignment 4 and discussion about extra assignments.
Group 1 Maisa Ibrahim El Rafie
(IPBR, SU, IADS, DBS, OOP), Susanne Hoerby Christensen (IPBR, ITPO, DBS, OOP, IADS), Safuriat Oloruntoyin Johnson (IADS), Thomas Stuart Henney (GP, IADS).
Group 2 John Jensen (GP, OOP).
Group 3 Rune Hessing Russel (GP, DBD, ITPL), Louise Lahn Soerensen (GP, DBD, ITPL), Henrik Bie (GP, DBD, ITPL) Annika Hollesen Eirikstoft (GP, DBD, ITPL).
Group 4 Brigitte Susan Kammersgaard-Andersen (GP), Zhiqiang Sun (IAD), Yuqin Cao (IAD).
Group 5 Anders Lund (GP, DBD), Simon Lei Fredslund (GP, DBD, ITPO).
Group 6 Yan Yi (IPBR, IADS), Filip Skov Skjerning (GP, W2), Mehrdad Mahrouiyan (GP).
Group 7 Karin Agerbaek Jensen (GP), Helene Noergaard Bech (GP).
Group 8 Nicolai Ritz (IADS, DS), Deike Dreiseberg (OOP, IADS).
Group 9 Tru Dong Tran (IADS), Suraj Ramalingam (IADS, IPBR, NP), Priyadarsini Seetharaman (IADS, IPBR, NP).
Group 10 Jens Lykke Brandt (IADS).
Group 11 Louise Kjaer Brun Pedersen (GP), Danny Mortensen (GP).
All teachers and students
Philip Bille (Teacher), email:beetle@itu.dk
Sebastian Jensen (Teaching assistant), email:seje@itu.dk
| day |
time |
| 26/11 (Sebastian is in Linux Lab) |
9-16 |
| 28/11 (Sebastian is in Linux Lab) |
9-16 |
| 2/12 (Sebastian is in room 3.15 or Linux Lab) |
9-16 |
| 4/12 (Sebastian is in room 4.05) |
9-16 |
| 9/12 (Sebastian is in room 4.05) |
9-16 |
| 11/12(Sebastian is in room 4.05) |
9-16 |
| 16/12(Sebastian is in room 4.05) |
9-16 |
| 18/12(Sebastian is in room 4.05) |
9-16 |