From: Paul Clark Newsgroups: comp.sys.arm,comp.sys.transputer Subject: Re: Floating Point Performance of the StrongARM Date: Wed, 03 Feb 1999 17:20:31 +0000 Organization: Systems Magic Ltd. Message-Id: <36B8855F.118C581D@sysmag.com> References: <797adq$br7$1@nnrp1.dejanews.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Xref: ukc comp.sys.arm:3040 comp.sys.transputer:9019 Brian.Oneill@ntu.ac.uk wrote: > Integer Float > StrongARM > 2.11 compiler 0.366 sec 34.50 sec > 2.50 compiler 0.366 sec 1.29 sec ^^^^ > J Browm=92s FP 10.30 sec Well, that certainly woke me from my lurking slumber! I'm afraid I find it very difficult to believe, though. A float op in less than 4 integer ops? = I'm afraid (having played with this myself many years ago), I think you're being bitten by an optimiser... > Below copy of our test code. Excellent. Time to pretend to be an optimiser. I'm sure you've already done this, but I'm still suspicious. What I'd really like to see is the ARM output... = > void BenchMark2(void) > { > //section used for floating points op > long unsigned int i, j; > float p, q, k, l, m; > float ans[10]; Nothing volatile here, but I think you've ensured that the results are all used by the two-stage accumulation into ans[0]. Fine. > Time(); > = > j=3D0; > = > for (i=3D0;i<100000;i++) > { > p=3D4.0F; > q=3D200.0F; Alarm bell - constants. Better to pass these into the function from outside and hope there isn't any inter-function optimisation. = > //benchmarking starts here > for (j=3D0;j<10;j++) > { > p++; > q++; Hmm. So p=3D[5..15], q=3D[201..211] > k =3D p + q; =3D> k =3D [206..226] > l =3D k*p; > m =3D l*q; I can't work it out, but a compiler could, if it chose to unroll this j[0..9] loop - the whole shebang would be constant folded down to ans[j] =3D > ans[j] =3D k + l + m; This is dangerous, because only the last iteration of the 100,000 is actually significant, and it's the same calculation every time. Better to accumulate in ans[j] (not forgetting to reset it at the start), and make sure some part of the calculation involves 'i'. However, since you're not getting 100,000* speedup, this probably isn't the problem. > } > = > Time(); Whoa! Think code motion here. What's to stop the compiler shifting some or all of that calculation right >here<, since it's all on the stack, and has no side effects that can effect Time(). Indeed, it'd be a cool thing to do, because it'll save ever having all the stack for ans[] alive during a subcall. I'd seriously consider including this final accumulation in the timing, and setting a global with the answer to force it to calculate it before calling Time(). The problem is, if the compiler _was_ doing this, you'd probably be reading zero time! > i =3D 0; > for (i=3D0;i<10;i++) > { > ans[0]=3Dans[i] + ans[0]; > } > = > writeHex(ans[1]); Oops! I think you meant ans[0] here. This could be it, in fact. = You're only demanding that ans[1] is calculated, which would take one tenth of the time to calculate ans[0..9], which is roughly what you're seeing (at least compared to Julian's library). If it decided to unroll the j[0..9] loop, it would do this easily, because ans[0] and ans[2..9] are never live, partly because of this typo, and partly because you always reset them on each iteration. > Exit(); [Stage left ;-] One way of solving most of these problems at a stroke is to put all the variables into volatile globals. That way it won't dare optimise anything, but the base overhead of the calculation will be higher. Hope this helps, P. -- = Paul Clark mailto:prc@sysmag.com $ whois pc52 Systems Magic Ltd. http://www.sysmag.com