Copenhagen Programming Language Seminar
Hadoop is a popular framework for processing large datasets. Many Hadoop jobs are very selective and operate on just a fraction of their input data, which can often be unstructured (for instance text files). In such scenarios it is impossible to apply out-of-the-box database optimizations. In this project at MSRC we have used static analysis techniques to examine the (executable bytecode of the) map phase of a job and automatically extract a filter that identifies the interesting ``rows'' and ``columns'' of the input data. Instead of sending all data from the storage to the compute cluster, we automatically identify and send only the subset of interest. Our automatically-generated filters are purely an optimization: they soundly approximate the set of interesting data, they are side-effect free (whereas mappers need not be), and can be killed or restarted on demand. Using our filters on example jobs, we have reduced network overheads by a factor of 5, and job completion times by a factor of 3 to 4 for certain jobs. In this talk I will emphasize the static analysis part and show how the domain of Hadoop map jobs makes a great fit for a very simple to implement, cheap to run, and effective in terms of improving job-completion times static analysis.
Fritz Henglein Administrative host:Jette Møller.
All are welcome.
The Copenhagen Programming Language Seminar (COPLAS) is a collaboration between DIKU, DTU, ITU, and RUC.
COPLAS is part of the FIRST Research School.
To receive information about COPLAS talks by email, send a message to firstname.lastname@example.org with the word 'subscribe' as subject or in the body.
For more information about COPLAS, see http://www.coplas.org