# Accelerating Symbolic Computations on NVidia Fermi Pavel Emeliyanenko

Max-Planck Institute for Informatics, Saarbrücken, Germany

asm@mpi-inf.mpg.de

Abstract: we present the first implementation of a modular resultant algorithm on GPUs [4,5]. With recent developments taking advantage of new NVidia Fermi GPU architecture and instruction set we have been able to achieve about 150x speedup over a CPU-based resultant algorithm from Maple 13.

#### **Motivation**

Resultants is a fundamental algebraic tool in the elimination theory. They have numerous applications, for instance in topological study of algebraic curves or computer graphics. Resultant of two **bivariate** polynomials **f** and **g** is the **determinant** of **Sylvester** matrix **S**:

$$f(x,y) = \sum_{i} f_{i}(x)y^{i} \quad g(x,y) = \sum_{i} g_{i}(x)y^{i}$$

$$res_{y}(f,g) = \det S_{f,g} \qquad S_{f,g} = \begin{cases} J_{4} & 0 & 0 & g_{3} & 0 & 0 & 0 \\ f_{3} & f_{4} & 0 & g_{2} & g_{3} & 0 & 0 \\ f_{2} & f_{3} & f_{4} & g_{1} & g_{2} & g_{3} & 0 \\ f_{1} & f_{2} & f_{3} & g_{0} & g_{1} & g_{2} & g_{3} \\ f_{0} & f_{1} & f_{2} & 0 & g_{0} & g_{1} & g_{2} \\ 0 & f_{0} & f_{1} & 0 & 0 & g_{0} & g_{1} \\ 0 & 0 & f_{0} & 0 & 0 & 0 & g_{0} \end{cases}$$

Computing resultant involves a **substantial amount** of symbolic operations which rapidly becomes a bottleneck for many exact geometric algorithms

### High-level structure of the algorithm

Following the ideas of classical "divide-conquer-combine" modular algorithm of Collins [1]:

given two bivariate polynomials with large integer coefficients use modular and evaluation homeomorphisms to reduce the problem to a simpler domain:

$$\mathbb{Z}[x,y] \to \mathbb{Z}_m[x,y] \to \mathbb{Z}_m[x,y]/(x-\alpha_m)$$

- compute univariate resultants over a prime field in parallel on the graphics hardware:  $z = res(f(y), g(y)): \mathbb{Z}_m[x, y]/(x - \alpha_m) \to \mathbb{Z}_m[x]/(x - \alpha_m)$

#### Introduction to Displacement structure

**Problem:** the amount of parallelism exhibited by the modular algorithm is **far too low** to satisfy the needs of massively-threaded architecture like that of GPU **Solution:** reduce the problem to computation with structured matrices because **matrix operations** typically map **very well** to the graphics hardware

#### Computing univariate resultants over a prime field

Computation of the resultant reduces to triangular factorization of Sylvester matrix S which is shift-structured [2]:  $S - ZSZ^T = GB^T$ 

 $Z \in \mathbb{Z}^{n \times n}$  is a down-shift matrix  $G, B \in \mathbb{Z}^{n \times 2}$  are generator matrices

The **Generalized Schur Algorithm** computes matrix factorization in  $O(n^2)$  time by operating solely on matrix generators:



|   | $\deg_{v} f: 19  \deg_{v} g: 1$ | 7        |       | 32×2 threads | $\deg_{y} f: 95  \deg_{y} g: 93$ |           |        | 96 threads |  |
|---|---------------------------------|----------|-------|--------------|----------------------------------|-----------|--------|------------|--|
|   | bits: 320 dense                 |          |       |              | bits: 16 sparse                  |           |        |            |  |
| ſ | $\deg_{x} f: 40  \deg_{x} g: 3$ | 0 56.7 s | 0.4 s | 215 × 1740   | $\deg_{x} f: 10  \deg_{x} g: 7$  | timed out | 6.35 s | 951 × 1604 |  |
|   | $\deg_{y} f: 31  \deg_{y} g: 2$ | 0        |       | 32×2 threads | $\deg_{y} f: 95  \deg_{y} g: 93$ | (> 15     |        | 96 threads |  |
|   | bits: 100 sparse                |          |       |              | bits: 120 dense                  | min)      |        |            |  |

 $deg_{x/y}$  – degrees in x/y of polynomials **f** and **g**; bits – coefficient bit-length; sparse/dense – varying density of polynomials; CUDA blocks executed: # of blocks run by 1<sup>st</sup> resultant kernel (**N** × **M**) and # of threads per block



Realization of 31-bit modular arithmetic on the GPU

## NVidia Fermi architectural features:

- native 32-bit integer multiplication support (instead of 24-bit multiplication on GT200)
- full-speed double-precision arithmetic (8x faster than that of GT200)
- modulo operation ("%") is costly: implement modular reduction in floating-point
- new set of video instructions: can do several arithmetic operations at a time (PTX assembly [3])

## Modular multiply-add: $(a + b \cdot c) \mod m$

## Vector updates: $(a \cdot b - c \cdot d) \mod m$

/ D2I TRUNC = (double)3^51 (fast mantissa truncation) '/ inv m = (double) 1 / mdouble f = (double)b \* (double)c \* inv m + D2I TRUNC; unsigned s = b \* c - double2loint(f) \* m;  $^{\prime}$  / equivalent to min(s, s + m) asm volatile("vadd.u32.u32.u32.min %0,%1,%2,%3;" : "=r"(s) : "r"(s), "r"(m), "r"(s)); s += a; // equivalent to min(s, s - m) asm volatile("vsub.u32.u32.u32.min %0,%1,%2,%3;" : "=r"(s) : "r"(s), "r"(m), "r"(s)); return s;

**Raw performance:** up to 154 GMad/s on the GTX480 graphics processor. **GMad/s** =  $10^9$  modular multiply-adds per second.

// inv m = (double) 1 / m**double** f1 = (**double**) a \* (**double**) b; double f2 = (double) c \* (double) d; **double** f = (f1 - f2) \* inv m; unsigned r = (unsigned) double2int rd(f); unsigned s = a \* b - c \* d - r \* m;// equivalent to min(s, s + m) asm volatile("vadd.u32.u32.u32.min %0,%1,%2,%3;" : "=r"(s) : "r"(s), "r"(m), "r"(s)); // equivalent to min(s, s - m) asm volatile("vsub.u32.u32.u32.min %0,%1,%2,%3;" : "=r"(s) : "r"(s), "r"(m), "r"(s)); return s;

For comparison: 2.5GHz Quad-Core Xeon E5420 can do about 1 GMad/s per core.

#### **References:**

[1] Collins G.E.: "The calculation of multivariate polynomial resultants", SYMSAC'71, 1971, 212-2 [2] Kailath T. and Sayed A.: "Displacement structure: theory and applications", SIAM review, 1995, 297–386 [3] PTX: Parallel Thread Execution. ISA Version 2.1. NVIDIA Corp., 2010 [4] Emeliyanenko P.: "Modular Resultant Algorithm for Graphics Processors", ICA3PP'10, 2010, 427-440 [5] Emeliyanenko P.: "A complete modular resultant algorithm targeted for realization on graphics hardware", PASCO'10, 2010, 35-43



