CnC: Tuning for distributed memory

The CnC tuning interface provides convenient ways to control the distribution of work and data across the address spaces. The tuning interface is separate from the actual step-code and its declarative nature allows flexible and productive experiments with different distribution strategies.

Distributing the work

Let's first look at the distribution of work/steps. You can specify the distribution of work (e.g. steps) across the network by providing a tuner to a step-collection (the second template argument to CnC::step_collection, see The Tuners). Similar to other tuning features, the tuner defines the distribution plan based on the control-tags and item-tags. For a given instance (identified by the control-tag) the tuner defines the placement of the instance in the communication network. This mechanism allows a declarative definition of the distribution and keeps it separate from the actual program code

you can change the distribution without changing the actual program.

The method for distributing steps is called "compute_on". It takes the tag of the step and the context as arguments and has to return the process number to run the step on. The numbering of processes is similar to ranks in MPI. Running on "N" processes, the host process is "0" and the last client "N-1".

   struct my_tuner : public CnC::step_tuner<>
   {
       int compute_on( const tag_type & tag, context_type & ) const { return tag % numProcs(); }
   };

The shown tuner is derived from CnC::step_tuner. To allow a flexible and generic definition of the distribution CnC::step_tuner provides information specific for distributed memory: CnC::tuner_base::numProcs() and CnC::tuner_base::myPid(). Both return the values of the current run of your application. Using those allows defining a distribution plan which adapts to the current runtime configuration.

If you wonder how the necessary gets distributed - this will be covered soon. Let's first look at the computation side a bit more closely; but if you can't wait see Distributing the data.

The given tuner above simply distributes the tags in a round-robin fashion by applying the modulo operator on the tag. Here's an example of how a given set of tags would be mapped to 4 processes (e.g. numProcs()==4):

1  -> 1
3  -> 3
4  -> 0
5  -> 1
10 -> 2
20 -> 0
31 -> 3
34 -> 2

An example of such a simple tuner is samples/blackscholes/blackscholes/blackscholes.h.

Now let's do something a little more interesting. Let's assume our tag is a pair of x and y coordinates. To distribute the work per row, we could simply do something like

   struct my_tuner : public CnC::step_tuner<>
   {
       int compute_on( const tag_type & tag, context_type & ) const { return tag.y % numProcs(); }
   };

As you see, the tuner entirely ignores the x-part of the tag. This means that all entries on a given row (identified by tag.y) gets executed on the same process. Similarly, if you want to distribute the work per column instead, you simply change it to

   struct my_tuner : public CnC::step_tuner<>
   {
       int compute_on( const tag_type & tag, context_type & ) const { return tag.x % numProcs(); }
   };

As we'll also see later, you can certainly also conditionally switch between row- and column-wise (or any other) distribution within compute_on.

To avoid the afore-mentioned problem of becoming globally inconsistent, you should make sure that the return value is independent of the process it is executed on.

CnC provides special values to make working with compute_on more convenient, more generic and more effective: CnC::COMPUTE_ON_LOCAL, CnC::COMPUTE_ON_ROUND_ROBIN, CnC::COMPUTE_ON_ALL, CnC::COMPUTE_ON_ALL_OTHERS.

Distributing the data

By default, the CnC runtime will deliver data items automatically to where they are needed. In its current form, the C++ API does not express the dependencies between instances of steps and/or items. Hence, without additional information, the runtime does not know what step-instances produce and consume which item-instances. Even when the step-distribution is known automatically automatic distribution of data requires global communication. Apparently this constitutes a considerable bottleneck. The CnC tuner interface provides two ways to reduce this overhead.

The ideal, most flexible and most efficient approach is to map items to their consumers. It will convert the default pull-model to a push-model: whenever an item becomes produced, it will be sent only to those processes, which actually need it without any other communication/synchronization. If you can determine which steps are going to consume a given item, you can use the above compute_on to map the consumer step to the actual address spaces. This allows changing the distribution at a single place (compute_on) and the data distribution will be automatically optimized to the minimum needed data transfer.

The runtime evaluates the tuner provided to the item-collection when an item is put. If its method consumed_on (from CnC::item_tuner) returns anything other than CnC::CONSUMER_UNKNOWN it will send the item to the returned process id and avoid all the overhead of requesting the item when consumed.

   struct my_tuner : public CnC::item_tuner< tag_type, item_type >
   {
       int consumed_on( const tag_type & tag ) 
       {
           return my_step_tuner::consumed_on( consumer_step );
       }
   };

As more than one process might consume the item, you can also return a vector of ids (instead of a single id) and the runtime will send the item to all given processes.

   struct my_tuner : public CnC::item_tuner< tag_type, item_type >
   {
       std::vector< int > consumed_on( const tag_type & tag ) 
       {
           std::vector< int > consumers;
           foreach( consumer_step of tag ) {
               int _tmp = my_step_tuner::consumed_on( consumer_step );
               consumers.push_back( _tmp );
           }
           return consumers;
       }
   };

Like for compute_on, CnC provides special values to facilitate and generalize the use of consumed_on: CnC::CONSUMER_UNKNOWN, CnC::CONSUMER_LOCAL, CnC::CONSUMER_ALL and CnC::CONSUMER_ALL_OTHERS.

Note that consumed_on can return CnC::CONSUMER_UNKOWN for some item-instances, and process rank(s) for others.

Sometimes the program semantics make it easier to think about the producer of an item. CnC provides a mechanism to keep the pull-model but allows declaring the owner/producer of the item. If the producer of an item is specified the CnC-runtime can significantly reduce the communication overhead because it on longer requires global communication to find the owner of the item. For this, simply define the depends-method in your step-tuner (derived from CnC::step_tuner) and provide the owning/producing process as an additional argument.

   struct my_tuner : public CnC::step_tuner<>
   {
       int produced_on( const tag_type & tag ) const
       {
           return producer_known ? my_step_tuner::consumed_on( tag ) : tag % numProcs();
       }
   };

Like for consumed_on, CnC provides special values CnC::PRODUCER_UNKNOWN and CnC::PRODUCER_LOCAL to facilitate and generalize the use of produced_on.

The push-model consumed_on smoothly cooperates with the pull-model as long as they don't conflict.

Keeping data and work distribution in sync

For a more productive development, you might consider implementing consumed_on by thinking about which other steps (not processes) consume the item. With that knowledge you can easily use the appropriate compute_on function to determine the consuming process. The great benefit here is that you can then change compute distribution (e.g. change compute_on) and the data will automatically follow in an optimal way; data and work distribution will always be in sync. It allows experimenting with different distribution plans with much less trouble and lets you define different strategies at a single place. Here is a simple example code which lets you select different strategies at runtime. Adding a new strategy only requires extending the compute_on function: samples/blackscholes/blackscholes/blackscholes.h A more complex example is this one: samples/cholesky/cholesky/cholesky.h

Using global read-only data with distCnC

Many algorithms require global data that is initialized once and during computation it stays read-only (dynamic single assignment, DSA). In principle this is aligned with the CnC methodology as long as the initialization is done from the environment. The CnC API allows global DSA data through the context, e.g. you can store global data in the context, initialize it there and then use it in a read-only fashion within your step codes.

The internal mechanism works as follows: on remote processes the user context is default constructed and then de-serialized/un-marshaled. On the host, construction and serialization/marshaling is done in a lazy manner, e.g. not before something actually needs being transferred. This allows creating contexts on the host with non-default constructors, but it requires overloading the serialize method of the context. The actual time of transfer is not statically known, the earliest possible time is the first item- or tag-put. All changes to the context until that point will be duplicated remotely, later changes will not.

Here is a simple example code which uses this feature: samples/blackscholes/blackscholes/blackscholes.h

Next: Beyond And With CnC