# Introduction to Algorithms and Data Structures, Spring 2005, Episode 2

 Spring 2005 Schedule Students

## Lecture, February 11

Subject : Divide & Conquer technique and MergeSort; Heaps, Max-Heaps, and Heap-Sort; Priority Queues; Stacks, Queues, and Lists.

Text : Pages 196-199. Chapter 10.1. and 10.2, Chapter 2.3 (skip 2.3.2 you will get a note on this later), page 123-126, Chapter 6, and [B.3 and B.4].

Prerequisites : Understanding of O-notation, worst case and log n.

Comments : Ponder the FIFO and LIFO properties of stacks and queues, and the fact that these data structures can be implemented using arrays.
For Merge-Sort, it is important that you understand how this algorithm can be defined recursively.
Think about why the parent, left and right functions of heaps implemented with arrays on page 128 is correct. On page 132 skip the text about "master theorem". Instead express the complexity using the height h. Section 6.3 is not essential for understanding the following chapters. This section includes some math that will probably make more sense after the lecture. If you do not have a lot of math experience you should save 6.3 for later.

Animations of heap sort and merge sort:
Heap sort

Time : Friday, February 11, 9:00–12:00

Room : 2A12

Tutorial : 13:00–16:00, 3A14

## Assignments

• CLRS exercises 10.1-1 (H), 10.1-2, page 203
• Implement some of the procedures from Chapter 10.2 (H). See the following program example
• CLRS exercises 10.2-1 (H), 10.2-2 (H), page 208.
• CLRS exercises 6.1-1, 6.1-2, page 129.
• CLRS exercises 6.1-3 (H), 6.1-4, 6.1-5, 6.1-6 (H), page 130.
• CLRS exercises 6.2-1 (H), page 132.
• CLRS exercises 6.3-1 (H), page 135.
• CLRS exercises 6.4-1 (H), 6.4-2, page 136.
• CLRS exercises 6.5-1 (H), 6.5-2 (H) page 140
• CLRS exercises 6.5-3 (M), page 141
• CLRS exercises 6.5-6, 6.5-7 (M) [Hint: heapify is not enough to retain the heap property], 6.5-8, pages 142.
• CLRS excercise 6.5-5 (E). This exercise is not mandatory, but it will be corrected if you hand it in. It will not be a part of the exercise session.
• CLRS problem 6-2. Extra assignment if you like numbers on page 143.
The mandatory assignments should be handed in no later than February 23 at 13.00.

The mandatory assigments (M) should be placed in your teaching assistent box, or send to the TA by email.

## Programming assignments

### Assignment 2.1

This assignment deals with the performance of Heap-Sort. We should implement the algorithm first, then evaluate its performance experimentally.

Start with implementing Heapify and Build-Heap and test these procedures on data before implementing Heap-Sort. You can use this code as a starting point if you like. Warning: In Java the elements in a table `A` are indexed from `0` to `A.length-1`, which means that you have to compute `parent`, `left` and `right` slightly differently than in CLRS.

Use your program to measure the performance of sorting 1000, 2000, 5000, and 10000 integers. Compare the results with the results from the Insertion-Sort program from last weeks' assignment.

### Assignment 2.2

One of the major features of Merge-Sort is that it manipulates objects close to each other in sequences, without moving them back and forth. This feature makes Merge-Sort suitable for sorting data on external memory (for example hard disks), where access to closely located elements in file order is significantly faster. In this assignment we use Merge-Sort to sort data, which would not easily fit into memory of most of PCs.

File 1GB.txt.zip contains 10 000 000 random character strings, each 100 characters long (Warning the file has 611 MB). The file format is as follows. The first line contains a single number, stating how many objects follow (hence it says "10000000" in our file). Then each line contains a random sequence of 100 characters. Note that in Java you can use `String.compareTo` method to compare strings.

Implement a function `merge_sort` which sorts strings saved in such files in lexicographic order using Merge-Sort algorithm. One popular way to this is to read the input in chunks (for instance a ten thousand entries at a time) and sort it in memory using some standard sorting algorithm (use Insertion-Sort, or better some build in sorting algorithm of your library, like Quicksort). After having the chunk sorted save it to a temporary file (remembering the name). Do it for the entire input, which should result in data divided into a thousand of temporary files. At this point you can start to apply merging. Your function should merge pairs of temporary files storing them in new files (decreasing the number of files to 500). Continue this merging until only one file is left.

It is convenient to test the implementation on smaller sets, so you should parameterize the program with the chunk size. You can experiment with various chunk sizes, to increase efficiency. Also you can use the following smaller files: 1K.txt.zip (10 entries), 10K.txt.zip (100 entries) and 100K.txt.zip (1000 entries).

Rationale: This assignment is "realistic"—it has been based on one of the previous teachers' experience from a software development project in the electronic data interchange industry. In this project the chunks of data were messages coming in large numbers from the network of banking systems.

Note: In case you lack experience, Peter Sestoft's note Text Files in Java introduces handling of text files in Java. Alternatively see section 21.7 in Peter's book Java Precisely.

Note: You do not really need gigabytes of disk space to solve this exercise. It is sufficient (and more feasible) to work on smaller files. Although bear in mind that the implementation should be designed in the way that it should be able to handle huge files, given you had enough disk resources. At the same time it should operate using a relatively modest amount of volatile memory.

Note: The example files have been generated by the generate program (statically linked Linux binary, source in Standard ML). Type ```./generate 100 > file.txt``` to generate a file named file.txt containing 100 random entries. There is no need to waste time neither on running nor understanding the `generate` program. You can complete the assignment without it. It has been provided only for those who are in the mood for experimentation.

 last updated February 7, 2005 Anna Pagh, Andrzej Wąsowski