Simple look at MPI

Hi all, it's been a while since I blogged last time. I am following a course on High performance computing in the university and one module of the course was about MPI (Message Passing Interface). MPI is being used to develop parallel programs. This post will not cover depths of MPI but will just give a brief introduction. However I'll post a link at the end so that you can learn more about MPI if you are interested.

Currently MPI runs on virtually any hardware platform. While developing application on MPI, programmer is responsible for correctly identifying parallelism.

There are various implementations of MPI exists. ex: mpich, LAM/MPI. Installation of different implementations have different steps.

Installation guide for mpich.

Installation guide for LAM/MPI

After the installation, to run MPI program you need a machines file. This is just a text file with the IP addresses of the each machine you will be using.

Ex: (file name: machines)

192.168.10.10
192.168.10.11
192.168.10.12

You can use the default cc compiler to compile standard C source files. However you will have to use mpicc compiler to compile MPI source files and you will need mpirun to execute MPI executable file.

Compile:
mpicc -o program program.c

Execute:
mpirun -np x -machinefile machines program

np: number of processes.
x: is an integer number. ex: 4
machinefile: path for the machines file.

All the MPI related functions have the following format.

MPI_X
Ex:
MPI_Init
MPI_Comm_rank
MPI_Comm_size
MPI_Finalize
MPI_Send
MPI_Recv

you can find more information about above function calls in the detailed guide I have linked in the bottom.

Lets look at a simple MPI program.

This will give following output.
 After modifying it to display the host name.



Now lets look at the a Matrix Multiplication with sequential approach and with MPI approach.

Code for sequential approach. (mm.c)
Code for MPI approach.

Little explanation for the MPI approach:

Plan is to split the multiplication in to several parts and distribute it among different worker nodes so they can separately compute and finally merge the results.

To distribute multiplication we can divide number of rows in the first matrix to the required amount of pieces. For example we will divide the first 4x4 matrix to two parts as follows.


With these two sets we can use entire second matrix to calculate the top and bottom parts of the final output. (Little bit of thinking will make you understand the concept)

Above code will read the number of tasks and identify number of workers. (Master will not do any work other than distributing and merging the arrays).

C uses row-major ordering to store arrays in the memory. So to send 2 rows to one of the worker machine you can use the following code.

MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD);

Here, rows*N number of elements from array "a" starting from [offset][0] position will be transferred to the worker machine.

MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD);

Above line says that N*N elements from array "b" will be send to the worker machine.

I think above explanation is enough to understand the entire code. If you have any questions please do comment.

Above sources are given for 4x4 matrix. I tested two sources with 1000x1000 matrix and the execution times are as follows.

Note: Above MPI approach will use only worker machines to the computation. I have modified the code little bit so the Master node will also participate for  the computation. Also I have removed the code lines which are related to printing the matrices so that only the run time will be printed in the screen.



By using 4 machines for the multiplication it was able to reduce run time by approximately 20 seconds. Impressive. :)

By using compiler optimizations you might be able to reduce the run time even more. Give it a try.

Most of the codes were taken from the sample materials which were provided with the course. So the credits should go for the original authors of those sources.

Detailed guide on MPI.

Update: I compiled the sequential code with different compiler optimization levels and were able to reduce the run time to nearly 10 seconds.

2 comments: