CnC: Tag-Ranges

Often the natural granularity of a step is rather small in the sense of having only low compute load. As the CnC convenience of programming with CnC (e.g. all the automatic parallelism) apparently comes with some overhead, this might not always lead to an efficient program. The CnC C++ API comes with a possibility to prescribe a bunch of tags at once. This feature is closely related to TBB's range concept. Like in TBB, the put range will be subject to recursive splitting/partitioning. Instead of scheduling individual tags, the runtime will then schedule entire ranges and so avoid some overhead induced by the scheduler. Major differences to TBB's range concept are firstly that CnC doesn't really treat it as a consecutive range but only as an iteratable set, and secondly that the interface to steps stays at the granularity of individual tags. There is only one interface to steps, e.g. you still define the execute method on tags (and not on tag-ranges) which allows for mixing puts of tags and tag-ranges within the same program.

In the primes example (/samples/primes/primes) we actually put a series of tags to trigger the graph evaluation. Let's use tag-ranges instead; it will give the runtime more freedom for optimization and some might even think it's nicer code. Primarily, tag-ranges as seen as a tuning capability so most of the functionality is within the tuning interface, e.g. in the tag-tuner.

First of all we have to declare the type of range we are going to use. All we need is a blocked range of integers, so let's just use TBB's blocked range. The range-type is declared through a tuner; the tag-tuner interface accepts the range type as its first template argument. As we are not going to provide any custom code, we can just use the CnC tuner-class directly:

    CnC::tag_collection< int, CnC::tag_tuner< CnC::Internal::strided_range< int > > > m_tags;

Now we can put the range instead of writing a for-loop (in main):

    c.m_tags.put_range( CnC::Internal::strided_range< int >( 3, n, 2 ) );

If you run and compile this code (samples/primes/primes_range/primes_range.cpp) you should observe from the scheduler statistics output that the number of scheduled steps is significantly smaller than the number of checked values. Please note that for this algorithm a more sophisticated partitioning strategy would be needed to achieve good scalability. The default partitioning creates equally sized ranges, which will lead to heavier partitions if the tag-values become bigger. Please refer to Advanced Range-Features for more details on the partitioner interface.

CnC::parallel_for

The above example checks twice as many values necessary, as it does execute the step also for even numbers. To avoid this, you can define your own range, which could for example use strides or list odd numbers in any other manner (see Advanced Range-Features). Alternatively you can use the convenient parallel_for construct. Besides avoiding unnecessary work, it even simplifies the code: You don't need a tag-collection and instead of prescribing a step you simply apply the step to parallel_for:

    CnC::parallel_for( 3, n+1, 2, FindPrimes(), CnC::pfor_tuner< false >() );

The last argument "false" is a tuning parameter (see Tuning Ranges) which tells the runtime that the step-code will not get any items (e.g. is independent of any other steps or items).See samples/primes/primes_parfor/primes_parfor.cpp for the full code.

Attention:: In distributed mode, parallel_for will not distribute the given work across processes even if the tuner provides a matching compute_on function. All implied steps-instances will always be scheduled locally.

Tuning Ranges

If you do use ranges or CnC::parallel_for and the step-code does not consume items, you can bypass overhead which is needed to handle data dependencies. The optional template parameter to step_tuner allows this in a very simple manner by setting it to "false":

  struct my_step_tuner : public CnC::step_tuner< false >
  { ... }

A even more simplified version of this parameter is used in CnC::parallel_for and CnC::context::parallel_for, e.g in this example: CnC::parallel_for.

As mentioned in Tag-Ranges the tuner also provides the partitioner. Tag ranges are a matter of tag-collections, so the providing a custom partitioner is done through assigning a tuner to the tag-collection. All you need to do is provide the tuner as the second optional argument to tag_tuner (note that the tuner is needed anyway for ranges):

  struct my_tag_tuner : public CnC::tag_tuner< my_range_type, my_partitioner_type >
  { ... }

Advanced Range-Features

Aspects of Tag-Ranges The implemented range-mechanism is very generic. It allows you to provide an arbitrary range as long as you also provide a partitioner which can handle the partitioning of your range (which might be anything, a tree, a matrix etc.). For a detailed description on the requirements on ranges and partitioners please refer to CnC::tag_collection::put_range and CnC::default_partitioner.

It is possible to let a step work on ranges rather than on individual tags. All you need to do is use the range for both, the tag-type and the range-type. To let the runtime use recursive partitioning you also need to provide a partitioner. The API comes with a pre-defined partitioner for this: CnC::tag_partitioner which is otherwise identical to the CnC::default_partitioner. Note: if your tag-type is a range, using CnC::tag_collection::put will not partition the range, only CnC::tag_collection::put_range will do so.

Partitioners are declared through providing a tuner at step-prescription time. Please read CnC::tag_tuner, CnC::default_partitioner and The Tuners for more details.

Next: Re-using CnC graphs (reductions, cross/join...)