|
|
Project Cluster: Programming Workshop 2009
October 2009
Introductory Programming (GPN)
Object-oriented Programming (OOP)
Table of contents
This page:
Welcome to the homepage of the Search Engine Project. The
Search Engine Project is a 7.5 ECTS project on implementing the
core functionality of a simple, text-based search-engine
(think Google).
In order to account for the difference in programming experience
between the different students, the project is divided in three parts:
- The joint part should be completed by all the
students, regardless of whether they come from GPN or OOP. It consists of
creating the core of the search engine, and a fully functional program
will be obtained at the end.
- The advanced part is only mandatory for OOP
students. It will add some features and optimizations to the search
engine.
- Finally, the extensions are a list of
optional features for the engines. Students who have completed their
mandatory tasks can pick any that they find interesting and start working
on them.
The project will be carried out in groups. Each group must consist of three or four persons.
Each group is required to do a project
agreement and to describe its work in a written
report. The course ends with an oral exam.
There will be a weekly meeting, on Monday between 13:00 and 15:00. We will
first discuss general issues and then offer more individual help.
We meet in Room 4A20, except for October 26 and November 9 where we meet in Room 4A14.
At the first project day, October 19th, the project will be presented, and
the programming techniques required for solving the
joint part will be presented. Therefore, in the description of the assignment below, the presence of unfamiliar
terms should not discourage you. Parts of the text will be especially
incomprehensible if you are unfamiliar with linked lists; linked lists
are one of the programming techniques discussed at the first project
day.
Three data files containing (parts of) old ITU web pages are available
for use as field data for the project. The following example illustrates the file format:
*PAGE:http://www.itu.dk/index.html
Here
is
a
word
and
one
word
more
*PAGE:http://www.itu.dk/anotherpage.html
Here
is
more
and
yet
more
The idea is that a data file contains information about a series of web pages.
The information for a web page starts with a *PAGE line which denotes
at which address the page can be found. Such an address is called a URL. The
list of words contained in the page follows the URL. The words are listed
one word per line in the same order as they occur on the page. As seen
in the example, each word can have several occurrences on the same page.
The following three data files are available:
As part of the background material, the program
SearchCmd.java is available. It
implements a small search engine. The program takes the name of the
data file used for searching as a
parameter.
In the program, the given parameter is accessible in
args[0] where args is the argument to the
main method. The program then constructs a linked list (see
data structures) with all the lines from
the data file given in args[0]. The
objects in the linked list contain the fields str and
next. The field str contains a line from the data
file and next points to the next object in the list. After
reading the data file, a simple user interface launches, enabling the
user to query the search engine: Given a word, the engine informs the user whether or not the word was present in the input.
The joint part consists of solving the following 4
assignments.
These assignments will be presented in detail during the first meeting, on
October 19th.
- Compile and run the program SearchCmd.
- Modify SearchCmd so that if the user requests a search for a
particular word (say, "Hans"), the URLs of all pages where the word
"Hans" occurs are written to the screen. This has to be done
by changing the search method in SearchCmd. It is not done by changing the data structure.
- Modify the construction of the data structure in such a way that a linked
list of objects containing three fields is created. The three fields are str,
next and urls. The fields str and next
are like in SearchCmd, with the exception that there are only words
in the list and each word only occurs once. In case an object contains the
word "Hans" in the field str, the object's urls field will
be a pointer to a list of URLs of all pages that contain the word "Hans". A list
of URLs consists of objects with two fields; next and url.
The field next is a pointer to the next object in the list and the
field url is (a pointer to) a URL. After construction of this data structure, you have to program the
corresponding search procedure(s). The data structure created in
this assignment has smaller size than the structure
from SearchCmd and
is faster to search through.
- Modify the solution from assignment 3 so as to use a hashtable (see data structures) instead of a linked list of words.
You can create the data structure using chained hashing. This means
that each entry contains a reference to a linked list of URLs. The
use of a hashtable will have a dramatic effect on both the time it
takes to initialize the search engine and the time it takes to
perform individual searches.
In the solution of the above assignments it is not allowed to use any
packages other than java.io and java.lang.
As a consequence it is not allowed to use any of the classes from the
java.util package: One of the purposes of this project is to
learn how to program various data structures yourself, as opposed to just using
data structures programmed by others.
As you can see, you are allowed to use different methods on String
objects. It is also allowed to use the method hashCode and to allocate arrays. If you have any doubt whether some Java functionality is allowed for use or not, contact one of the teachers.
For the assignments after the joint part, there are no restrictions on the
Java functionality you may use. We however recommend that you do not spend too
much time looking for advanced and unnecessarily complicated features in the
vast collection of java libraries.
The advanced part is mandatory for OOP students. The task is to add
support for the boolean operators AND and OR to the search engine.
The resulting search engine should accept three kinds of input
strings:
- Ordinary words, as before. Example: "Hans".
- A word AND another word. Example: "Hans AND Grethe".
- A word OR another word. Example: "Hans OR Grethe".
For example, the input "Hans AND Grethe" should result in a list of
those pages where both words "Hans" and "Grethe" occur, while the input
"Hans OR Grethe" should result in a list of those pages where either
"Hans" or "Grethe" (or both) occurs.
You are not required to support more complex expressions
containing several occurrences of AND and OR. (If you are interested
in solving that problem, see one of the suggested extensions below.)
As a starting point you should use your solution to assignment 4 of
the joint part (i.e., the hashtable-based solution).
The advanced part
is deliberately more open-ended than the joint part. However, we
recommend that you proceed as follows:
- Modify the search method in SearchCmd such that it
recognizes (but does not yet respond to) the three kinds of
input strings listed above; you may want to use the java.util.StringTokenizer class. If the input is an ordinary word, a
normal search is performed. If the input is, e.g., "Hans AND
Grethe", the method simply prints "And: Hans Grethe" instead of
performing a search. If the input is, e.g., "Hans OR Grethe", the
method simply prints "Or: Hans Grethe" instead of performing a
search.
(This advice applies to everybody, not just new programmers. Failure
to divide tasks into sub-tasks can result in spectacularly
complicated debugging sessions where the error is finally traced to a
supposedly "easy" part.)
- Modify the search method such that it responds correctly to "AND"
input strings. For now, aim for the simplest and easiest solution
you can think of, without worrying about performance. Do the same
for "OR" input strings.
- (Optional.) Try to improve the design. The goal is to make
searches faster and/or decrease the memory usage. You may want to
look at the java.util.HashSet class or other classes from the
Java Collections Framework.
The following are suggestions for assignments to do after
the mandatory part(s).
The ordering of these suggestions
is random.
- Write a web crawler.
- Use the extended data file.
- Investigate how to easily support prefix-search. For example a search for "arm*" should find all pages with "arms",
"armoury", "armadillo", "armageddon", "armistice" etc.
This only requires an extra table of pointers to the words, where the table is ordered by words.
- Create a graphical user interface.
- Make the program more efficient in terms of usage of space (mostly for people studying Introduction to algorithms and data structures).
Idea: Instead of having pointers to URL-strings, you can put the various URL-strings in a table. The pointers can now be
replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don't know how large the
URL-table has to be, you double it's size each time it runs full (this is described in CLRS). By using amortized analysis it can be
shown that this doesn't increase the running time significantly. Instead of using a list of URL's with each word, a table of URL's
represented as numbers can now be used. This also requires doubling techniques and amortized analysis. For words that occur in almost
all pages a third strategy can be used; namely a "bit vector". The idea is that a word occurs in the page if and only if the corresponding bit has a value 1.
- Construct your own hashCode method. See the reference to hashing under Data structures.
- Find other data other than homepages
to search through and modify your search engine to fit your situation.
For example, you could be searching for grep patterns in a text file.
- Design a tool that, given a web page x,
will search for web pages similar to x.
- Enable the user limit the search to URLs matching a pattern such as
*.dk or itu.dk/dkm/*.
- Combine two user-imposed constraints on searching, for example limited-domain search and search for a particular prefix.
- Support complex boolean searches, for instance
((((a AND b) OR c) AND f OR b) OR g) AND NOT h.
- If a search produces no result, it might be because the user performed an erroneous keystroke, so that, e.g., an "a" became an "s". In such cases you can choose to return pages which contain words almost matching the word searched for.
- Support free text searches. That is, give the opportunity to search for (fragments of) sentences. A strategy is
to implement a free-text search as an AND boolean search. This limits the search to a small amount of
pages which are then examined in closer detail. Another approach is to calculate hash values for sentences instead of
words. A third possibility is a more compact representation of the text. Instead of having the word repeated
a certain number of times, you remember at which positions in the homepage the word occurs. Find your own
favorite strategy.
- How can the data structure be updated dynamically? You can choose to collect information
on a lot of new homepages before updating. Another approach is to implement dynamic data structures
such as dynamic search trees.
The method chosen will be a compromise between
space, search time and the time for updating the data structure.
- Implement the joint part using vectors, collections etc. and compare
this (these) solution(s) to the solution from the joint part where these classes were not used.
The following are general requirements on the report:
-
The report must contain a foreword telling who has fulfilled the assignment,
what the assignment contains, where the assignment is made (ITU), during
what period, and who the teachers are. Furthermore, a short
statement on whether the project went as planned is required. The foreword
should not exceed one page.
-
The document "project-agreement" is used as the cover page for the assignment.
Furthermore there are the following requirements on the report that are
related to the three search engines implemented in the joint part:
- You have to describe how the different search engines work, which data structures
are used and how they are used. This description must be in a natural language and Java code
may only be used to support the description.
- You have to document your investigation of how effective the different search engines are by running them.
- You have to compare the different implementations to each other,
by using your benchmarking results from the preceding clause
and/or by applying algorithm analysis. People studying
algorithms are expected to use O-notation in this comparison.
-
You have to document correctness testing of the search engines. Both an internal test
(also called white box/structural test) and an external test (also called black box/functional test) must be performed and documented.
A note on the subject of testing is available.
- The code for your search engines and documentation of tests etc. must be handed in as appendices to the project report. In addition, your code must be handed in on a CD or USB stick.
What the report contains apart from the joint part, depends on the work done by the group and the chosen focus of this work.
There is a note on what a project
report might look like. This note is recommended for preparation for this project. In that note,
an estimate of 25 pages is made for a project report. For this concrete project our estimate is
less than 20 pages using a standard font (20 being a large report). Neither appendices nor figures are counted towards
the 20 pages. Experience shows that the following structure makes for a good
report:
- Cover page (the "project-agreement")
- Foreword
- Walk-through of SearchCmd 1
- Walk-through of SearchCmd 2
- Walk-through of SearchCmd 3
- Walk-through of SearchCmd 4
- Benchmarking results for the different search engines together with theoretical explanations of the results.
- Tests
- Extra assignments (including the advanced part if you carry it out)
The description of SearchCmd can be presented as follows:
- Describe the Input/Output-relation ("What does the program do?") without reference to code in such a way that it could be read by anyone.
- Describe the data structure without reference to the specific implementation.
- Describe ideas, invariants etc. for the code.
- Describe in general terms and without references to the code how it is structured.
- Describe details of the code if needed.
- Describe how much time and space the code uses.
- Compare to previously described search engines.
In order to complete the assignments in the joint part
you need (an understanding of) linked lists and hashing.
Linked lists are explained at the first
project day. If you don't know about linked lists
you can take a look at
this homepage.
You can also look at ListProgram.java
which contains a small example of the use of linked lists.
As supplementary material on hashing we recommend this homepage.
N.B.: Be aware that each group must consist of three or four persons.
To sign up for the project, your group must fill in a project
agreement in the project base. The process is described in the Study
Guide; see in particular here and
here.
You will need the following information:
- Project title (Danish): "Søgemaskineprojekt."
- Project title (English): "Search Engine Project."
- Language (report): English.
- What will be handed in: Written report. Source code
on CD or USB stick.
- Project type: "7.5 ECTS".
- Size in ECTS: "7.50".
- Project period: "October 18, 2009—December 16, 2009, 15.00".
- Examination language: English.
As for the method, you can write "Programming, testing,
and performance measuring using the Java programming language" or similar.
As for the problem formulation, you can use the following text
as a starting point (as well as the other descriptions on this page):
The goal of this project is to implement the core functionality of
a simple, text-based search engine and study its performance. See the
project cluster web
page: http://www.itu.dk/courses/JK01/E2009/
In the first part of the project, the following assignments must be solved:
- Compile and run the
program SearchCmd, supplied by the
supervisors.
- Modify SearchCmd so that if the user requests a search for a
particular word (say, "Hans"), the URLs of all pages where the word
"Hans" occurs are written to the screen. This has to be done
by changing the search method in SearchCmd. It is not done by changing the data structure.
- Modify the construction of the data structure in such a way that a linked
list of objects containing three fields is created. The three fields are str,
next and urls. The fields str and next
are like in SearchCmd, with the exception that there are only words
in the list and each word only occurs once. In case an object contains the
word "Hans" in the field str, the object's urls field will
be a pointer to a list of URLs of all pages that contain the word
"Hans". After constructing this data structure, program
the corresponding search procedure(s).
- Modify the solution from assignment 3 so as to use a hashtable
instead of a linked list of words.
In the second part of the project, the task is to <...>
When the projects are handed in an exam plan will be made ASAP.
Here is an example of how to compile and run a java program through the command line:
$ ls
Example.java
$ javac Example.java
$ ls
Example.java Example.class
$ java Example argument1 argument2 argument3
Program output
|
|