Overview
The approach of MultiDimensional Homomorphisms (MDH) is an algebraic formalism for systematically reasoning about decomposition and recomposition strategies of dataparallel computations (such as linear algebra routines and stencil computations) for the memory and core hierarchies of stateoftheart parallel architectures (GPUs, multicore CPU, multidevice and multinode systems, etc).
The MDH approach (formally) introduces:
 HighLevel Program Representation (Contribution 1) that enables the user conveniently implementing dataparallel computations, agnostic from hardware and optimization details;
 LowLevel Program Representation (Contribution 2) that expresses device and dataoptimized de and recomposition strategies of computations;
 Lowering Process (Contribution 3) that fully automatically lowers a dataparallel computation expressed in its highlevel program representation to an optimized instance in its lowlevel representation, based on concepts from automatic performance optimization (a.k.a. autotuning), using the AutoTuning Framework (ATF).
The MDH’s lowlevel representation is designed such that Code Generation from it (e.g., in OpenMP for CPUs, CUDA for NVIDIA GPUs, or OpenCL for multiple kinds of architectures) becomes straightforward.
Our Experiments report encouraging results on GPUs and CPUs for MDH as compared to stateofpractice approaches, including NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN which are handoptimized libraries provided by vendors.
Ultimate MDH Goals
 Performance: achieve performance competitive to handoptimized solutions
 Portability: target any kind of parallel architecture
 Productivity: free user from hardware and optimization details
Getting Started
(Our implementation of MDH will be open sourced on GitHub)
Code Examples
From the following code examples, our MDH compiler generates fully automatically device and dataoptimized, executable program code, e.g., in OpenMP for CPUs, CUDA for NVIDIA GPUs, or OpenCL for multiple kinds of architectures.
MDH’s PythonBased User Interface

MatrixVector Multiplication (MatVec) expressed in MDH’s highlevel program representation:
def matvec(T: ScalarType, I: int, K: int): @mdh( out( w = Buffer[T, [I]] ) , inp( M = Buffer[T, [I, K]], v = Buffer[T, [K]] ) ) def mdh_matvec(): def mul(out, inp): out['w'] = inp['M'] * inp['v'] def scalar_plus(res, lhs, rhs): res['w'] = lhs['w'] + rhs['w'] return ( out_view[T]( w = [lambda i, k: (i)] ), md_hom[I, K]( mul, ( CC, PW(scalar_plus) ) ), inp_view[T, T]( M = [lambda i, k: (i, k)] , v = [lambda i, k: (k) ] ) )
The above defined
matvec
function is used as follows:# MatVec on 1024x1024sized input matrix and 1024sized vector (both containing fp32 values) matvec__fp32__1024_1024 = matvec( fp32, 1024,1024 ) # ... (CUDA host code: create CUDA context, CUDA buffers for "M","v", "w", etc.) # Get MDH "CUDA Module" for MatVec (using ATFtuned optimizations) cuda__matvec__fp32__1024_1024 = matvec__fp32__1024_1024.get_module( CUDA(), pyATF( CUDARuntimeProfiler(), evaluations(1000) ) ) # MDH CUDA Module: compile & load CUDA code a100_cuda__matvec__fp32__1024_1024 = cuda__matvec__fp32__1024_1024.compile( arch='compute_80' ) # MDH CUDA Module: run MatVec on M,v to obtain w a100_cuda__matvec__fp32__1024_1024.run( w,M,v ) # MDH CUDA Module: destroy module a100_cuda__matvec__fp32__1024_1024.destroy() # ... (CUDA host code: destroying CUDA context, freeing CUDA buffers, etc.)

Jacobi1D (Jacobi1D) expressed in MDH’s highlevel program representation:
def jacobi1d(T: ScalarType, I: int): @mdh( out( y = Buffer[T, [I]] ) , inp( x = Buffer[T, [I + 2]] ) ) def mdh_jacobi1d(): def f_j1d(out, inp): out['y'] = (inp['x', 1] + inp['x', 2] + inp['x', 3]) / 3.0 return ( out_view[T]( y = [lambda i: (i)] ), md_hom[I]( f_j1d, ( CC ) ), inp_view[T]( x = [lambda i: (i + 0), lambda i: (i + 1), lambda i: (i + 2)] ) )
The above defined
jacobi1d
function is used as follows:# Jacobi1D on a 1024sized input vector (containing fp32 values) jacobi1d__fp32_1024 = jacobi1d( fp32, 1024 ) # ... (CUDA host code: create CUDA context, CUDA buffers for "x", "y", etc.) # Get MDH "CUDA Module" for Jacobi1D (using ATFtuned optimizations) cuda__jacobi1d__fp32_1024 = jacobi1d__fp32_1024.get_module( CUDA(), pyATF( CUDARuntimeProfiler(), evaluations(1000) ) ) # MDH CUDA Module: compile & load CUDA code a100_cuda__jacobi1d__fp32_1024 = cuda__jacobi1d__fp32_1024.compile( arch='compute_80' ) # MDH CUDA Module: run Jacobi1D on x to obtain y a100_cuda__jacobi1d__fp32_1024.run( y,x ) # MDH CUDA Module: destroy module a100_cuda__jacobi1d__fp32_1024.destroy() # ... (CUDA host code: destroying CUDA context, freeing CUDA buffers, etc.)
MDH’s MLIRBased User Interface

func.func @main() { %M = memref.alloc() : memref<128x64xf32> %v = memref.alloc() : memref<64xf32> %w = mdh.compute "mdh_matvec" { inp_view = [ [ affine_map<( i,k ) > ( i,k )> ], [ affine_map<( i,k ) > ( k ) > ] ], md_hom = { scalar_func = @mul, combine_ops = [ "cc", ["pw",@add] ] }, out_view = [ [ affine_map<( i,k ) > ( i )> ] ] } { inp_types = [ f32, f32 ], mda_size = [ 128,64 ], out_types = [ f32 ] }( %M,%v ) : ( memref<128x64xf32> , memref<64xf32> ) > memref<128xf32> return }
Here, functions
@mul
and@add
are straightforward, userdefined functions for computing scalar multiplication or scalar addition, respectively (both not shown for brevity). Functionscc
andpw
are preimplemented combine operators for computing concatenation (cc
) or pointwise operations (pw
), respectively. 
func.func @main() { %x = memref.alloc() : memref<64xf32> %y = mdh.compute "mdh_jacobi1d" { inp_view = [ [ affine_map<( i ) > ( i+0 )> , affine_map<( i ) > ( i+1 )> , affine_map<( i ) > ( i+2 )> ] ], md_hom = { scalar_func = @jacobi, combine_ops = [ "cc" ] }, out_view = [ [ affine_map<( i ) > ( i )> ] ] } { inp_types = [ f32 ], mda_size = [ 62 ], out_types = [ f32 ] }( %x ) : (memref<64xf32>) > (memref<62xf32>) return }
Here, function
@jacobi
is the Jacobispecific computation (not shown for brevity), andcc
is a preimplemented combine operator for computing concatenation.
Automatic Parallelization & Optimization
Additionally, MDH supports as inputs – as an alternative to DSL programs in MDH’s highlevel programming interface (shown above) – also straightforward (annotated) sequential program code. For our MatVec example, our Pythonbased input code is of the following form:
def matvec(T: ScalarType, I: int, K: int):
@mdh( out( w = Buffer[T, [I]] ) ,
inp( M = Buffer[T, [I, K]], v = Buffer[T, [K]] ) ,
combine_ops = ( CC, PW(scalar_plus) ) )
def mdh_matvec(w, M, v):
for i in range(I):
for k in range(K):
w[i] = M[i, k] * v[k]
This program is completely equivalent to the DSLbased MDH program for MatVec shown above and used exactly the same:
# MatVec on 1024x1024sized input matrix and 1024sized vector (both containing fp32 values)
matvec__fp32__1024_1024 = matvec( fp32, 1024,1024 )
# ... (CUDA host code: create CUDA context, CUDA buffers for "M","v", "w", etc.)
# Get MDH "CUDA Module" for MatVec (using ATFtuned optimizations)
cuda__matvec__fp32__1024_1024 = matvec__fp32__1024_1024.get_module( CUDA(), pyATF( CUDARuntimeProfiler(), evaluations(1000) ) )
# MDH CUDA Module: compile & load CUDA code
a100_cuda__matvec__fp32__1024_1024 = cuda__matvec__fp32__1024_1024.compile( arch='compute_80' )
# MDH CUDA Module: run MatVec on M,v to obtain w
a100_cuda__matvec__fp32__1024_1024.run( w,M,v )
# MDH CUDA Module: destroy module
a100_cuda__matvec__fp32__1024_1024.destroy()
# ... (CUDA host code: destroying CUDA context, freeing CUDA buffers, etc.)
ScheduleBased Optimization Process
MDH optionally allows incorporating expert knowledge into the optimization process, using its scheduling language. By incorporating the user into the optimization process, we enable two major advantages over the standard MDH workflow:
 better optimization, as an autotuning system might not always make the same highquality optimization decisions as a human expert
 faster autotuning, as some (or even all) optimization decisions might be made by the expert user and thus are not left to the costly autotuner
(An example scheduling program follows soon)
Publications

A. Rasch
(De/Re)Composition of DataParallel Computations via MultiDimensional Homomorphisms
ACM Transactions on Programming Languages and Systems (TOPLAS 2024)
Paper Slides 
A. Rasch
Full Version: (De/Re)Composition of DataParallel Computations via MultiDimensional Homomorphisms
arXiv 2024
Paper Slides 
A. Rasch, R. Schulze, D. Shabalin, A. Elster, S. Gorlatch, M. Hall
(De/Re)Compositions Expressed Systematically via MDHBased Schedules
ACM SIGPLAN International Conference on Compiler Construction (CC 2023)
Paper Slides 
A. Rasch, R. Schulze, and S. Gorlatch
Generating Portable HighPerformance Code via MultiDimensional Homomorphisms
International Conference on Parallel Architectures and Compilation Techniques (PACT 2019)
Paper Slides 
A. Rasch, R. Schulze, M. Gorus, J. Hiller, S. Bartholomäus, S. Gorlatch
HighPerformance Probabilistic Record Linkage via MultiDimensional Homomorphisms
ACM/SIGAPP Symposium On Applied Computing (SAC 2018)
Paper Slides 
A. Rasch, S. Gorlatch
MultiDimensional Homomorphisms and Their Implementation in OpenCL
International Journal of Parallel Programming (IJPP 2018)
Paper
WIP/Short Papers & Talks

A. Rasch, R. Schulze, Jens Hunloh, Lars Hunloh
Code Generation & Optimization for DeepLearning Computations via MultiDimensional Homomorphisms
Compilers for Machine Learning (C4ML 2024), (lightning talk)
Slides 1. Poster 2. Poster 
A. Rasch, R. Schulze, S. Gorlatch
Array Programming via MultiDimensional Homomorphisms
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2023), (WIP paper)
Paper Slides Talk 
R. Schulze, A. Rasch, S. Gorlatch
Code Generation & Optimization for DeepLearning Computations on GPUs via MultiDimensional Homomorphisms
International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2021), Best Poster Finalist, (short paper)
Paper Poster Slides Talk 
A. Rasch, R. Schulze, S. Gorlatch
Using MLIR for MultiDimensional Homomorphisms
Google SIG MLIR Open Design Meeting 2020, (invited talk)
Slides Talk 
A. Rasch, S. Gorlatch
md_stencil: HighPerformance Stencil Computations on CPU and GPU via MultiDimensional Homomorphisms
International Conference on Parallel Architectures and Compilation Techniques (PACT 2020), (SRC – Gold Winner)
Paper Poster Slides Talk 
A. Rasch, S. Gorlatch
md_poly: A PerformancePortable Polyhedral Compiler Based on MultiDimensional Homomorphisms
IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2020), (SRC – Gold Winner)
Paper Poster Slides 
A. Rasch, R. Schulze, S. Gorlatch
Performance, Portability, and Productivity for DataParallel Applications on Multi and ManyCore Architectures
International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), (doctoral showcase)
Paper Poster Slides
Citations
Please use the following citations, when referring to MDH’s:
 Formalism & Design
@article{10.1145/3665643, author = {Rasch, Ari}, title = {(De/Re)Composition of DataParallel Computations via MultiDimensional Homomorphisms}, year = {2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, issn = {01640925}, url = {https://doi.org/10.1145/3665643}, doi = {10.1145/3665643}, journal = {ACM Trans. Program. Lang. Syst.}, month = {may}}
 Scheduling Language
@inproceedings{10.1145/3578360.3580269, author = {Rasch, Ari and Schulze, Richard and Shabalin, Denys and Elster, Anne and Gorlatch, Sergei and Hall, Mary}, title = {(De/Re)Compositions Expressed Systematically via MDHBased Schedules}, year = {2023}, isbn = {9798400700880}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3578360.3580269}, doi = {10.1145/3578360.3580269}, booktitle = {Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction}, pages = {6172}, numpages = {12}, keywords = {GPU, scheduling languages, deep learning, CPU}, location = {Montr\'{e}al, QC, Canada}, series = {CC 2023} }
 Automatic Parallelization and Optimization
@INPROCEEDINGS{mdpoly, author={Rasch, Ari and Schulze, Richard and Gorlatch, Sergei}, booktitle={Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT 2020)}, title={\texttt{md\_poly}: A PerformancePortable Polyhedral Compiler based on MultiDimensional Homomorphisms}, year={2020}, volume={}, number={}, pages={14}}
Contact
Ari Rasch
Focus:  Formalism 
Affiliation:  University of Münster 
Email:  a.rasch@unimuenster.de 
Website:  arirasch.net 
Richard Schulze
Focus:  Implementation 
Affiliation:  University of Münster 
Email:  r.schulze@unimuenster.de 
Website:  richardschulze.net 
You can also find us on Discord.