| Spring 2005 | Schedule | Students |
This assignment deals with the performance of Heap-Sort. We should implement the algorithm first, then evaluate its performance experimentally.
Start with implementing Heapify and Build-Heap and test these procedures on data before
implementing Heap-Sort. You can use this code as a starting point if you like.
Warning: In Java the elements in a table A are
indexed from 0 to A.length-1, which means
that you have to compute parent, left and
right slightly differently than in CLRS.
Use your program to measure the performance of sorting 1000, 2000, 5000, and 10000 integers. Compare the results with the results from the Insertion-Sort program from last weeks' assignment.
One of the major features of Merge-Sort is that it manipulates objects close to each other in sequences, without moving them back and forth. This feature makes Merge-Sort suitable for sorting data on external memory (for example hard disks), where access to closely located elements in file order is significantly faster. In this assignment we use Merge-Sort to sort data, which would not easily fit into memory of most of PCs.
File 1GB.txt.zip contains 10 000 000
random character strings, each 100 characters long (Warning the
file has 611 MB). The file format is as follows. The first line
contains a single number, stating how many objects follow (hence it
says "10000000" in our file). Then each line contains a random
sequence of 100 characters. Note that in Java you can use
String.compareTo method to compare strings.
Implement a function merge_sort which sorts strings
saved in such files in lexicographic order using Merge-Sort algorithm. One popular way to this is to
read the input in chunks (for instance a ten thousand entries at a
time) and sort it in memory using some standard sorting algorithm (use
Insertion-Sort, or better some build in
sorting algorithm of your library, like Quicksort). After having the chunk sorted save it
to a temporary file (remembering the name). Do it for the entire
input, which should result in data divided into a thousand of
temporary files. At this point you can start to apply merging. Your
function should merge pairs of temporary files storing them in new
files (decreasing the number of files to 500). Continue this merging
until only one file is left.
It is convenient to test the implementation on smaller sets, so you should parameterize the program with the chunk size. You can experiment with various chunk sizes, to increase efficiency. Also you can use the following smaller files: 1K.txt.zip (10 entries), 10K.txt.zip (100 entries) and 100K.txt.zip (1000 entries).
Rationale: This assignment is "realistic"—it has been based on one of the previous teachers' experience from a software development project in the electronic data interchange industry. In this project the chunks of data were messages coming in large numbers from the network of banking systems.
Note: In case you lack experience, Peter Sestoft's note Text Files in Java introduces handling of text files in Java. Alternatively see section 21.7 in Peter's book Java Precisely.
Note: You do not really need gigabytes of disk space to solve this exercise. It is sufficient (and more feasible) to work on smaller files. Although bear in mind that the implementation should be designed in the way that it should be able to handle huge files, given you had enough disk resources. At the same time it should operate using a relatively modest amount of volatile memory.
Note: The example files have been generated by the generate
program (statically linked Linux binary, source in Standard ML). Type ./generate
100 > file.txt to generate a file named file.txt containing
100 random entries. There is no need to waste time neither on running
nor understanding the generate program. You can complete
the assignment without it. It has been provided only for those who are
in the mood for experimentation.
|
|
|
|
last updated February 7, 2005 |