# Luddy School of Informatics, Computing, and Engineering

## Technical Report TR568:Matrix Factorization Using a Block-Recursive Structure and Block-Recursive Algorithms

Jeremy D. Frens
(Sep 2002), 169 pages pages
[PhD Dissertation]
Abstract:
The divide-and-conquer paradigm yields algorithms that parallelize easily, a very important consideration in high-performance computing. However, high-performance computing also relies on local reuse of data in a memory hierarchy of registers, caches, main memory, and swapping disks.

I have worked with a sequential representation of quadtree matrices that is a divide-and-conquer data structure. This representation uses an indexing scheme, Morton ordering, that automatically blocks the elements of the matrix and promotes memory locality in recursive algorithms over the quadtree. This work focuses on using this representation for two important matrix algorithms, matrix multiplication and QR factorization.

Techniques for generating independant and local code were discovered while implementing a recursive matrix-matrix multiplication over the sequential quadtree matrix. For good memory locality the basic multiplication routine was written as two dual, mutually recursive functions with the recursive calls in each version precisely ordered to reuse data already in cache.

To avoid unncessary work, minimal decorations annotate each submatrix of the matrix; these decorations are used to ignore zero blocks and to ellide the zero tests on blocks known to be dense. Finally, a strategy was developed for dispatching parallel processes in the multiplication algorithm.

These lessons were then used to develop an efficient QR factorization algorithm for quadtree matrices. It also uses recursive functions that localize data in blocks and that balance parallel processing.

The sequential quadtree matrix and multiplication functions are implemented in C because of its optimizing compilers. Since C optimizers favor iterative co de and are deficient for recursive code, some base optimizations were done by ha nd. These hand optimizations could and should be done by some future optimizing compiler.

Results are mostly favorable; while the quadtree-matrix algorithm s do not always perform at the same level as decades-old routines (i.e., BLAS an d LAPACK), which are fine-tuned for traditional storage, the quadtree algorithms are competitive and even expose some flaws in these old routines.

Available as: 