From: Ramiro Willmersdorf <jblues@my-deja.com>
Newsgroups: comp.parallel
Subject: Unrolling murders performance
Date: 3 Jun 1999 16:06:43 GMT
Organization: Deja.com - Share what you know. Learn what you don't.
Approved: bigrigg@cs.cmu.edu
Message-Id: <7j696j$mno$1@goldenapple.srv.cs.cmu.edu>
Originator: bigrigg@ux6.sp.cs.cmu.edu
Xref: ukc comp.parallel:15644


Hi,

Is there a reason for this an unrolled loop such as :

***************************************************************
      do 100 iter = 1, niter
*---!MIC$ DOALL PRIVATE(i, istep), SHARED(a,b,c,d)
         do 20 i = 1, (nstep-1)*NROLL+1, nroll
*     0
            c(i) = c(i) + a(i)*a(i) + b(i)*b(i)
            d(i) = c(i) + a(i)*b(i)
*     1
            c(i+1) = c(i+1) + a(i+1)*a(i+1) + b(i+1)*b(i+1)
            d(i+1) = c(i+1) + a(i+1)*b(i+1)
*     2
            c(i+2) = c(i+2) + a(i+2)*a(i+2) + b(i+2)*b(i+2)
            d(i+2) = c(i+2) + a(i+2)*b(i+2)
*     3
            c(i+3) = c(i+3) + a(i+3)*a(i+3) + b(i+3)*b(i+3)
            d(i+3) = c(i+3) + a(i+3)*b(i+3)
*     4
            c(i+4) = c(i+4) + a(i+4)*a(i+4) + b(i+4)*b(i+4)
            d(i+4) = c(i+4) + a(i+4)*b(i+4)

 20      continue

*     Remaining iterations
         do 30 i = nstep*nroll+1, BIG
            c(i) = c(i) + a(i)*a(i) + b(i)*b(i)
            d(i) = c(i) + a(i)*b(i)
 30      continue

 100  continue
***************************************************************

to run *four* times as slow as the original loop:


***************************************************************

      do 100 iter = 1, niter
*--- !MIC$ DOALL PRIVATE(i), SHARED(a,b,c,d)
           do 20 i = 1, BIG
                c(i) = c(i) + a(i)*a(i) + b(i)*b(i)
                d(i) = c(i) + a(i)*b(i)
   20      continue
  100 continue


***************************************************************

The loop just above is representative of the loop that uses
most time in a fortran program I'm trying to paralellize with
a Sun Enterprise 450 with 4 processors. ``BIG'' is very big :)
It runs Solaris 2.6 with Workshop Fortran 4.2.

Obviously, with such light loops and long vectors, when I try to
force paralellism, the performace just sucks, it takes
twice as long to run. That's actually what I expected.

``Hey, no problem, I'll just unroll the loop, and eventually
it's *gotta* get enough load tomake the startup costs low enough.

The problem is that when I run the loop as unrolled above,
it takes about *four* times as long to run!  22 seconds
against 5 seconds, give or take.

I hardly expected that.  The unrolled loop even parallelizes
very well, however, it still takes longer than the sequential
loop since it's baseline is so much slower.

The loops in the original program are actually somewhat
more complex, but I don't want to tackle them before I
understand what's going on here.

I'd be very grateful for any insight,

Ramiro.


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

--
Articles to bigrigg+parallel@cs.cmu.edu (Admin: bigrigg@cs.cmu.edu)
Archive: http://www.hensa.ac.uk/parallel/internet/usenet/comp.parallel

