[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CommsTime times?



Dyke/Oyvind et al.,

> >   6. With the PAR version of delta, the communication loads are the same
> >      as for the SEQ delta, but there is the additional overhead of starting
> >      up and shutting down one new process per cycle.
> 
> In the hardware case, with the optimisations reported at CPA2000, the
> reverse is the case. A 2-component PAR is FASTER than a 2-component SEQ.
> At best, if both communications are ready to proceed, PAR is twice as fast
> as SEQ.
> 
> In general PAR is faster than SEQ - if PAR truly is parallel, but the current
> implementation may not quite achieve that for the particular case of a
> 3-component PAR/SEQ.
> 
> 	    Barry.

Don't forget, however, that Peter's commstime benchmark program runs
just one value around the cycle of three processes.  Thus, there are
not enough potential communications at any instant to keep truly
parallel hardware fully busy - on FPGA we are not time-sharing a common
processor as in the other implementations being discussed.

A long time ago, I experimented with various other non-standard
variations of the commstime benchmark that potentially produced more
communications to the 'consumer' process per clock cycle when run on an
FPGA:

* Inject two messages into the cycle at the start of 'prefix'
  [but this scenario is still limited by the number of processes in the
  cycle prepared to communicate at any time - now we have too few]

* Add extra buffering as well - provide two 'prefix' processes, resulting
  in two circulating packets and four processes to provide more parallel 
  slackness;

* Inject even more messages into an extended cycle;

* Run several parallel cycles, all feeding the same consumer process ...

... and so on.  Of course, none of these are then directly comparable with
Peter's figures, but some of his context-switch assumptions fall apart
when we use parallel hardware rather than a context-switching CPU
anyway!

My real message:  When using truly parallel hardware, there do appear
to be real performance benefits from creating very many tiny little
processes, and flooding their interconnects with lots of synchronising
messages.  The almost complete hiding of 2-way PAR and PAREND
overheads, as well as the overheads of small ALTS, and the relatively
low overheads of 3-way and 4-way PARs and PARENDs [on Xilinx target
architectures, at least] certainly make the use of lots of
double-buffering and excess parallelism very worthwhile when compiling 
Occam to hardware.  

As Barry says, our CPA2000 paper provides several examples.  There's a
copy in http://www.ee.surrey.ac.uk/Personal/R.Peel/logic.html .

Roger.


Dr. Roger M.A. Peel
Senior Lecturer in Digital Systems
Department of Computing
School of Electronic Engineering, Information Technology and Mathematics 
University of Surrey                   
Guildford                              Phone: +44 1483 879284 (01483 within UK)
Surrey  GU2 7XH                          Fax: +44 1483 876051
United Kingdom                         Email: R.Peel@xxxxxxxxxxxxxxxx