[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The world needs process-orientation



All,

Interesting times.  Just as I find one mailing list discussing the
future of process-orientation (my preferred umbrella term for CSP and
similar ideas), I find another crying out for it.  Game Development
Algorithms is a list I was pointed to some time ago when someone
(Richard, I forget his surname!) had pointed the list at my C++CSP
library as an example of a different approach to concurrency (i.e.
process-orientation).  At the time I came close to veering off-topic
and the discussion petered out.  Fast-forward a few months and another
discussion ends up on the topic of concurrency again, and the usual
problems are bandied about - threads too heavyweight, too dangerous,
too hard to reason about - but with no good solutions being suggested.

Needless to say I couldn't resist discussing process-orientation again. 
Games are an interesting area; they have historically been concerned
about performance above all else.  This has led to the recent consoles
having major parallelism inside - the Xbox 360 has 3 CPUs, and the PS3
has the Cell.  So games developers will find themselves forced to use
concurrency, but are discovering the usual problems with it.  The
Sweeney talk (linked to below in my other email) is very interesting, I
suggest you all read it if you are worried that what we do is becoming
less relevant rather than more; the latter is very much the case.  

It seems we do not so much lack the technology (although obviously the
more process-oriented tools, especially those offering an easier change
for C/C++/Java developers as most developers are, the better), but
rather awareness of our work, and the easier way.  I know people such
as Peter have been trying for years to spread the word, and this I
think is what we need.  But I'm sure I am teaching you all to suck
eggs!

Anyway, I include below my attempted plea to the game developers' list,
along with the thread that had preceded it.  Developers are beginning
to realise that a better way is needed, we just need to bridge the gap
to them by allaying their scepticism and offering real tools that will
solve their problems.  While I would love for them all to start using
occam or the like, perhaps (excusing my self-indulgence) a port of
C++CSP to the Cell and 360 is more the approach that is needed in the
short-term.

Neil.



> -------- Original Message --------
> Subject: RE: [Algorithms] tyranny of amdahl's law...
> From: neil@xxxxxxxxxxxxxxxxx
> Date: Fri, June 09, 2006 9:57 am
> To: Game Development Algorithms
> <gdalgorithms-list@xxxxxxxxxxxxxxxxxxxxx>
> 
> I saw a mention of Sweeney's talk - that can be found here:
> http://www.st.cs.uni-sb.de/edu/seminare/2005/advanced-fp/docs/sweeny.pdf
> 
> It ties in well with some of the points discussed here.  Nicholas
> writes:
> 
> "The Windows performance team works hard to minimize the number of
> threads in the system. (Despite their efforts, there are 465 active
> threads in this system as I type, and 43 processes)... But, every
> thread uses an appreciable amount of resource - nonpaged pool, a kernel
> stack, etc. that reduces overall system performance."
> 
> I contend that rather than minimising the number of threads involved
> (which clearly isn't working), they should have devoted their efforts
> to reducing the amount of resource that threads use.  Threads are
> incredibly heavyweight, and do not have to be constructed thus.  Tom
> points this out below; the cost of threads is due to Windows.  Although
> Linux is probably not an issue to many developers on the list, it
> commits the same crime.
> 
> Nicholas also writes:
> 
> "One can imagine new abstractions or thread sync mechanisms or other
> ways to structure the app to reduce this overhead.  But, maybe it
> should be addressed at the device level."
> 
> Whether you address it at the device level or try to make the best of
> what you have, what I think desperately needs to be changed is the
> language.  C/C++ are completely inadequate for describing concurrency
> in languages.  Sweeney's talk concurs with this - he believes that
> functional languages may be the way out of the hole.  I may be unfairly
> stereotyping to say that some game developers are unlikely to consider
> anything other than C/C++ - this may be because all their developers
> are masters of those languages, or it may be due to performance.  The
> Unreal engine seems to contradict that latter point - a great portion
> of it is written in a bytecode language, but they do not seem to suffer
> performance issues.  If the next equivalent of UnrealScript were to
> support concurrency, then we might see some better use of concurrency.
> 
> I believe that languages such as occam (http://frmb.org/occtutor.html)
> had a better philosophy on concurrency - treating it as natural and
> expected rather than a last resort - a viewpoint that seems to
> naturally lead to concurrent mechanisms being so ugly that you only
> want to use them as a natural resort.
> 
> I hope this is of interest to some of you, and I think it is on-topic
> given the current discussion.  Concurrency is only becoming more
> prevalent (witness 360's multi-CPU design and the PS3's Cell), so
> surely it is time to find a better way to deal with it?
> 
> Thanks,
> 
> Neil.
> 
> 
> > -------- Original Message --------
> > Subject: Re: [Algorithms] tyranny of amdahl's law...
> > From: "Tom Forsyth" <tom.forsyth.gdalgo@xxxxxxxxxxxxxxxx>
> > Date: Fri, June 09, 2006 8:38 am
> > To: "'Game Development Algorithms'"
> > <gdalgorithms-list@xxxxxxxxxxxxxxxxxxxxx>
> > 
> > Creating threads is expensive because of the *software* (i.e. the OS). It's
> > nothing to do with the hardware - if you have control of the OS and make up
> > a few small restrictions, then thread-switches can be extremely cheap.
> > Indeed - this is what fibers were invented for. Real-time and embedded OSes
> > do this very well.
> > 
> > So the fact that GPUs can have lots of threads may be partly influenced by
> > hardware, but mainly it's the fact that their idea of a "thread" is far more
> > lightweight than what more conventional OS and code considers to be a
> > thread. So this is an argument for lightening the OS, not really for
> > changing hardware.
> > 
> > Now, if you want to have the high-throughput vs low-latency discussion,
> > that's a moderately different kettle of fish - though it is certainly
> > related. But the fact that a Pentium 4 is out-of-order is not why threading
> > is slow on Windows.
> > 
> > 
> > TomF.
> > 
> > 
> > > -----Original Message-----
> > > From: gdalgorithms-list-bounces@xxxxxxxxxxxxxxxxxxxxx 
> > > [mailto:gdalgorithms-list-bounces@xxxxxxxxxxxxxxxxxxxxx] On 
> > > Behalf Of Nicholas Wilt
> > > Sent: 08 June 2006 18:40
> > > To: gdalgorithms-list@xxxxxxxxxxxxxxxxxxxxx
> > > Subject: Re: [Algorithms] tyranny of amdahl's law...
> > > 
> > > 
> > > It seems to me there are some fundamental problems with the 
> > > programming
> > > models commonly used for multicore CPUs.  They don't scale well.  The
> > > usual structure of a multicore-aware application is an app 
> > > thread with N
> > > worker threads, where N==the number of cores.  All N+1 threads must be
> > > created at app initialization time, since thread creation is expensive
> > > and can fail.  So the worker threads are suspended on creation (say by
> > > calling WaitForSingleObject), and when the app thread decides to do
> > > something parallelizable, it divides the work, calls SetEvent (or
> > > something) N times, calls WaitForMultipleObjects (or 
> > > something) to wait
> > > till the workers are done.... now the worker threads each wake up, do
> > > their thing, call SetEvent to handshake back to the app thread.  At
> > > least on Windows, all those thread sync calls do kernel transitions.
> > > Expensive.
> > > 
> > > There's another layer of complexity added by the idea that 
> > > most apps are
> > > composed of collections of libraries, often written by separate
> > > vendors.. Left to their own devices (i.e. without an
> > > application-arbitrated protocol for sharing thread 
> > > resources), it seems
> > > to me these library vendors would tend to create N threads 
> > > _per_library_
> > > at initialization time.  Otherwise, whose thread do you 
> > > signal when it's
> > > time to do a parallel task?  If you have 10 multicore-aware libraries
> > > linked into an app, a quad-core system would have 40 
> > > mostly-idle threads
> > > sitting around in that process waiting to do some work for their
> > > library.  It may be possible for some app developers to 
> > > define protocols
> > > for thread sharing, though there is a surprising amount of per-thread
> > > state that could make that problematic. (thread local 
> > > storage, FPCW and
> > > other control registers, etc.)
> > > 
> > > All this would be fine if idle threads were free, but they 
> > > aren't.  The
> > > Windows performance team works hard to minimize the number of 
> > > threads in
> > > the system. (Despite their efforts, there are 465 active 
> > > threads in this
> > > system as I type, and 43 processes).  Obviously the system would be
> > > unusable if almost all of those threads weren't sitting idle.  But,
> > > every thread uses an appreciable amount of resource - nonpaged pool, a
> > > kernel stack, etc. that reduces overall system performance.  I am sure
> > > there'd be a noticeable performance decrease if each of the 
> > > 43 processes
> > > in my system had more threads sitting idle to support multicore.
> > > 
> > > These issues probably don't apply to consoles etc. where it's 
> > > easier to
> > > negotiate a shared-threading protocol and they probably run in kernel
> > > mode all the time.  But they seem pretty serious on the prevalent PC
> > > platform.
> > > 
> > > 
> > > One can imagine new abstractions or thread sync mechanisms or 
> > > other ways
> > > to structure the app to reduce this overhead.  But, maybe it should be
> > > addressed at the device level.  Maybe we need a new type of asymmetric
> > > coprocessor with many execution cores, designed for maximum throughput
> > > rather than minimum latency.  Maybe this asymmetric coprocessor should
> > > treat execution threads more like yeast, creating them in huge numbers
> > > cheaply and easily to perform specific tasks and destroying them with
> > > impunity upon completion, than like oxen that are expensive to create
> > > and maintain.  Besides being better-architected to address
> > > parallelizable computations, such a device would have the additional
> > > advantage that it could overlap its execution with the CPU(s),
> > > proffering an additional level of parallelism.  Hey, waitasec...
> > > 
> > > 
> > > Message: 1
> > > Date: Thu, 8 Jun 2006 12:16:16 -0700
> > > From: "Jeff Roberts" <jeffr@xxxxxxxxxxxxxxxx>
> > > Subject: [Algorithms] tyranny of amdahl's law...
> > > To: <gdalgorithms-list@xxxxxxxxxxxxxxxxxxxxx>
> > > Message-ID: <009701c68b30$0a14e370$f2a8a8c0@xxxxxxxxxxxxxxxxxxxxxxx>
> > > Content-Type: text/plain; format=flowed; charset="iso-8859-1";
> > > 	reply-type=original
> > > 
> > > Posted this a couple weeks ago on a private list and was urged to post
> > > here 
> > > too...
> > > 
> > > -----------
> > > So, this is kind of cool.  I just measured my new 16 way machine
> > > compressing 
> > > some Bink videos, so that I could see how well I was scaling.  Here's
> > > the 
> > > table:
> > > 
> > > 2 CPUs = 1.88 times as fast
> > > 3 CPUs = 2.68 times as fast
> > > 4 CPUs = 3.39 times as fast
> > > 8 CPUs = 5.59 times as fast
> > > 12 CPUs = 7.19 times as fast
> > > 16 CPUs = 8.39 times as fast
> > > 
> > > So, by the time I'm at 16 CPUs, my total increase is well off the 16x
> > > that 
> > > you would hope for.  Damn, I must be memory bound or something, right?
> > > I'm 
> > > not getting 16x at all!!  How optimal am I, really?
> > > 
> > > Well, Amdahl's law (which is really just the law of 
> > > diminishing returns
> > > with 
> > > some terms rearranged) states that the *maximum speedup* that can be 
> > > obtained with N processors is:
> > > 
> > > 1 / ( %_sequential_code + ( %_parallel_code / N ) ) = speed_up
> > > 
> > > So, a quick timing in Bink on one processor says that the compression
> > > and 
> > > the motion compensation loops (which can be multi-threaded) 
> > > take 94% of
> > > the 
> > > time.  OK, man, my threading must be screwed, right?  Like if 
> > > 94% of the
> > > 
> > > time can be multi-threaded, then 94% of the time can be sped up by the
> > > 16 
> > > processors, so that should be close to 16x speed up, right?
> > > 
> > > Well, let's plug the numbers in for 16 CPUs:  16 CPUs, 6% sequential
> > > code, 
> > > 94% parallel code:
> > > 
> > > 1 / ( 0.06 + ( 0.94 / 16 ) )  = 8.42
> > > 
> > > Shit!  That's almost exactly our observed time!  Now do the rest:
> > > 
> > > 2 CPUs: 1 / ( 0.06 + ( 0.94 / 2 ) ) = 1.89 (observed: 1.88)
> > > 3 CPUs: 1 / ( 0.06 + ( 0.94 / 3 ) ) = 2.68 (observed: 2.68)
> > > 4 CPUs: 1 / ( 0.06 + ( 0.94 / 4 ) ) = 3.39 (observed: 3.39)
> > > 8 CPUs: 1 / ( 0.06 + ( 0.94 / 8 ) ) = 5.63 (observed: 5.59)
> > > 12 CPUs: 1 / ( 0.06 + ( 0.94 / 12 ) ) = 7.23 (observed: 7.19)
> > > 16 CPUs: 1 / ( 0.06 + ( 0.94 / 16 ) ) = 8.42 (observed: 8.39)
> > > 
> > > So, at this point, I have to say that since I didn't really intuit
> > > Amdahl's 
> > > equation, I was getting totally freaked out that the numbers were so
> > > close 
> > > (the hairs on my neck were all standing up).  I was like, 
> > > "how the hell
> > > did 
> > > he know how fast my routine was going to be, the f*cker?"
> > > 
> > > But, it just makes sense, right?  You can only speed up the parallel
> > > parts 
> > > of the program, so those parts get smaller and smaller, but 
> > > the serial 
> > > portion doesn't change, so it dominates the time, as the 
> > > parallel chunks
> > > get 
> > > faster.
> > > 
> > > The only thing to intuit is that the sequential time dominates at the
> > > limit 
> > > (which is obvious, but just doesn't seem like a big deal at 
> > > first).  So,
> > > 
> > > even if 99% of your application is parallel, and you had *infinite* 
> > > processors, you can only get 100 times faster.  Which makes sense, but
> > > still 
> > > feels weird in my brain.
> > > 
> > > So, in my case, how'd I do?  Well, I'm scaling 100% perfectly 
> > > with the 
> > > number of CPUs - I can't possibly do any better with the existing 
> > > architecture, so I'm not memory bound or anything - I'm just 
> > > screwed by
> > > that 
> > > tiny 6% that I can't parallelize.
> > > 
> > > Man, I hate that 6% now!  That measly little 6% drops me from possible
> > > 16x 
> > > speed up on my 16-way machine all the way down to 8.39x - 
> > > which is just 
> > > crazy.
> > > 
> > > The really scary thing is that video compression is massively 
> > > easier to 
> > > multithread than game code.  I feel bad for you guys.
> > > 
> > > ->Jeff
> > > RAD Game Tools
> > > --------------------------------------------------------------
> > > ---------------------
> > > This email message is for the sole use of the intended 
> > > recipient(s) and may contain
> > > confidential information.  Any unauthorized review, use, 
> > > disclosure or distribution
> > > is prohibited.  If you are not the intended recipient, please 
> > > contact the sender by
> > > reply email and destroy all copies of the original message.
> > > --------------------------------------------------------------
> > > ---------------------
> > > 
> > > 
> > > _______________________________________________
> > > GDAlgorithms-list mailing list
> > > GDAlgorithms-list@xxxxxxxxxxxxxxxxxxxxx
> > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > > Archives:
> > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDAlgorithms-list@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_id=6188