CnC: Running CnC applications on distributed memory

In principle, every clean CnC program should be immediately applicable for distributed memory systems. With only a few trivial changes most CnC programs can be made distribution-ready. You will get a binary that runs on shared and distributed memory. Most of the mechanics of data distribution etc. is handled inside the runtime and the programmer does not need to bother about the gory details. Of course, there are a few minor changes needed to make a program distribution-ready, but once that's done, it will run on distributed CnC as well as on "normal" CnC (decided at runtime).

Inter-process communication

Conceptually, CnC allows data and computation distribution across any kind of network; currently CnC supports SOCKETS and MPI.

Linking for distCnC

Support for distributed memory is part of the "normal" CnC distribution, e.g. it comes with the necessary communication libraries (cnc_socket, cnc_mpi). The communication library is loaded on demand at runtime, hence you do not need to link against extra libraries to create distribution-ready applications. Just link your binaries like a "traditional" CnC application (explained in the CnC User Guide, which can be found in the doc directory).

Note:: a distribution-ready CnC application-binary has no dependencies on an MPI library, it can be run on shared memory or over SOCKETS even if no MPI is available on the system

Even though it is not a separate package or module in the CNC kit, in the following we will refer to features that are specific for distributed memory with "distCnC".

Making your program distCnC-ready

As a distributed version of a CnC program needs to do things which are not required in a shared memory version, the extra code for distCnC is hidden from "normal" CnC headers. To include the features required for a distributed version you need to

 #include <cnc/dist_cnc.h>

instead of

 #include <cnc/cnc.h>

. If you want to be able to create optimized binaries for shared memory and distributed memory from the same source, you might consider protecting distCnC specifics like this:

   #ifdef _DIST_
   # include <cnc/dist_cnc.h>
   #else
   # include <cnc/cnc.h>
   #endif

In "main", initialize an object CnC::dist_cnc_init< list-of-contexts > before anything else; parameters should be all context-types that you would like to be distributed. Context-types not listed in here will stay local. You may mix local and distributed contexts, but in most cases only one context is needed/used anyway.

   #ifdef _DIST_
       CnC::dist_cnc_init< my_context_type_1 //, my_context_type_2, ...
                         > _dinit;
   #endif

Even though the communication between process is entirely handled by the CnC runtime, C++ doesn't allow automatic marshaling/serialization of arbitrary data-types. Hence, if and only if your items and/or tags are non-standard data types, the compiler will notify you about the need for serialization/marshaling capability. If you are using standard data types only then marshaling will be handled by CnC automatically.

Marshaling doesn't involve sending messages or alike, it only specifies how an object/variable is packed/unpacked into/from a buffer. Marshaling of structs/classes without pointers or virtual functions can easily be enabled using

 CNC_BITWISE_SERIALIZABLE( type);

others need a "serialize" method or function. The CnC kit comes with an convenient interface for this which is similar to BOOST serialization. It is very simple to use and requires only one function/method for packing and unpacking. See Serialization for more details.

This is it! Your CnC program will now run on distributed memory!

Attention:: Global variables are evil and must not be used within the execution scope of steps. Read Using global read-only data with distCnC about how CnC supports global read-only data. Apparently, pointers are nothing else than global variables and hence need special treatment in distCnC (see Serialization).

Note:: Even if your program runs on distributed memory, that does not necessarily imply that the trivial extension above will make it run fast. Please consult Tuning for distributed memory for the tuning options for distributed memory.

The above describes the default "single-program" approach for distribution. Please refer to to CnC::dist_cnc_init for more advanced modes which allow SPMD-style interaction as well as distributing parts of the CnC program over groups of processes.

Running distCnC

The communication infrastructure used by distCnC is chosen at runtime. By default, the CnC runtime will run your application in shared memory mode. When starting up, the runtime will evaluate the environment variable "DIST_CNC". Currently it accepts the following values

SHMEM : shared memory (default)
SOCKETS : communication through TCP sockets
MPI : using Intel(R) MPI

Please see Using Intel(R) Trace Analyzer and Collector with CnC on how to profile distributed programs

Using SOCKETS

On application start-up, when DIST_CNC=SOCKETS, CnC checks the environment variable "CNC_SOCKET_HOST". If it is set to a number, it will print a contact string and wait for the given number of clients to connect. Usually this means that clients need to be started "manually" as follows: set DIST_CNC=SOCKETS and "CNC_SOCKET_CLIENT" to the given contact string and launch the same executable on the desired machine.

If "CNC_SOCKET_HOST" is not a number it is interpreted as a name of a script. CnC executes the script twice: First with "-n" it expects the script to return the number of clients it will start. The second invocation is expected to launch the client processes.

There is a sample script "misc/start.sh" which you can use. Usually all you need is setting the number of clients and replacing "localhost" with the names of the machines you want the application(-clients) to be started on. It requires password-less login via ssh. It also gives some details of the start-up procedure. For windows, the script "start.bat" does the same, except that it will start the clients on the same machine without ssh or alike. Adjust the script to use your preferred remote login mechanism.

MPI

CnC comes with a communication layer based on MPI. You need the Intel(R) MPI runtime to use it. You can download a free version of the MPI runtime from http://software.intel.com/en-us/articles/intel-mpi-library/ (under "Resources"). A distCnC application is launched like any other MPI application with mpirun or mpiexec, but DIST_CNC must be set to MPI:

env DIST_CNC=MPI mpiexec -n 4 my_cnc_program

Alternatively, just run the app as usually (with DIST_CNC=MPI) and control the number (n) of additionally spawned processes with CNC_MPI_SPAWN=n. If host and client applications need to be different, set CNC_MPI_EXECUTABLE to the client-program name. Here's an example:

env DIST_CNC=MPI env CNC_MPI_SPAWN=3 env CNC_MPI_EXECUTABLE=cnc_client cnc_host

It starts your host executable "cnc_host" and then spawns 3 additional processes which all execute the client executable "cnc_client".

Intel Xeon Phi(TM) (MIC)

for CnC a MIC process is just another process where work can be computed on. So all you need to do is

Build your application for MIC (see http://software.intel.com/en-us/articles/intel-concurrent-collections-getting-started)
Start a process with the MIC executable on each MIC card just like on a CPU. Communication and Startup is equivalent to how it works on intel64 (MPI and Using SOCKETS).

Note:: Of course the normal mechanics for MIC need to be considered (like getting applications and dependent libraries to the MIC first). You'll find documentation about this on IDZ, like here and/or here; We recommend starting only 2 threads per MIC-core, e.g. if your card has 60 cores, set CNC_NUM_THREADS=120; To start different binaries with one mpirun/mpiexec command you can use a syntax like this:
mpirun -genv DIST_CNC=MPI -n 2 -host xeon xeonbinary : -n 1 -host mic0 -env CNC_NUM_THREADS=120 micbinary

Default Distribution

Step instances are distributed across clients and the host. By default, they are distributed in a round-robin fashion. Note that every process can put tags (and so prescribe new step instances). The round-robin distribution decision is made locally on each process (not globally).

If the same tag is put multiple times, the default scheduling might execute the multiply prescribed steps on different processes and the preserveTags attribute of tag_collections will then not have the desired effect.

The default scheduling is intended primarily as a development aid. your CnC application will be distribution ready with only little effort. In some cases it might lead to good performance, in other cases a sensible distribution is needed to achieve good performance. See Tuning for distributed memory.

Next: Tuning for distributed memory