This challenge is about writing your own crawler, which downloads webpages from the internet. Fundamental skills in html is required to do this excercise-
The file DownloadHtml.java contains an example of a Java program, that downloads a homepage at a given URL and writes its content to the screen. You can write a crawler by using this example to run through the pages downloaded, extracting words and references to other pages.
Another possibility is finding the logical structure of the homepages downloaded, then extracting references and words. The logical structure of a homepage can be represented by a tree. For instance take a look at the homepage:
<html>
<head>
<title>
Title of the homepage
</title>
</head>
<body>
Here is something in
<i>
italics
</i>
and here is a
<a href="http://www.it-c.dk">
reference
</a>
</body>
</html>
(the content can be seen here.
This homepage represents the following tree:
<html>
/ \
/-----------/ \-------------\
| |
<head> <body>
| / | | \
| /--------------/ | | \------------------\
| | | | |
| | /----/ \--\ |
| | | | |
| "Here is <i> "and here is a" <a href="http://www.it-c.dk">
| something in" | |
| "italics" "reference"
|
<title>
|
"Title of the homepage"
There is a Java package jtidy which can take
a homepage and create the tree for the page. This process is also called
to parse the homepage. In this case errors on the homepage could occur if
for example <i> was closed with </a>. In these cases
jtidy tries to make some guesses and correct a little to create a tree confirming
to the html standard.
You can browse the online documentation
for jtidy and get the file Tidy.jar
which is necessary to run and compile programs using jtidy.
At the homepage of the Search Engine Project further information on the
use of jar files can be found.
You can get the example TidyExample.java
which prints what can be found directly under the body tag of a homepage
at a given URL. If it is run on the homepage above, the following is printed:
Text:Here is something in Node:i Text: and here is a Node:a Attr name:href Attr value:http://www.it-c.dk
There are several things to be aware of when writing a multithreaded crawler.