Question: does the T9000 have "out-of-order execution" or not?

Roger Shepherd replies:

Yes. I'll clarify this. What I'll do is tell you roughly how the T9000
processor works - what follows is a first order approximation to the T9000's
implementation.

---------------START DESCRIPTION --------------------------------------

The T9000's CPU gets its speed from a number of sources. The most important
of these are:

     1) quad-banked unified main cache 
        - bank is determined by bits 4 and 5 of address
        - each bank is fully associative, 4 word per line write-back cache
        - cache may be disabled and used as RAM, or mixture of cache and RAM

     2) workspace cache
        - 32 words, allows 2 reads and 1 write per cycle, write through

     3) instruction grouper and execution pipeline

I hope points 1 and 2 are self-explanatory, I will concentrate on point 3.

The T9000 operates by "parsing" the instruction stream, forming sequences of
instructions into "groups" and then executing the groups on a five stage
pipeline. A group can contain up to 8 instructions. (Although this suggests a
maximum performance of 8 ipc, the T9000 is fetch rate limited to 4 ipc - it
is. however, very useful to be able to issue at the above rate since (i) some
groups will take more than 1 cycle to execute, and (ii) it is not always
possible to group/issue all the instructions which have been fetched.)

So, what can a group contain? Well, a group can contain (nearly) anything
that can be executed by the pipeline! So, what does the pipeline look like?
See below:

        1       2         3          4       5

      LOCAL  ADDRESS  MAIN-CACHE
                                  ALU/FPU  WRITE
      LOCAL  ADDRESS  MAIN-CACHE 

The first stage of the pipeline (the "local" stage) contains two parallel
function units each capable of performing operations such as "load local" or
"load constant". 

The second stage (the "addressing" stage) contains two parallel function
units each capable of performing the addressing arithmetic needed by
instructions such as "bsub", "wsub", "ldnl" etc. 

The third stage, ("main cache") permits two accesses to be made to the memory
system - these can proceed simulataneously if the data are contained in
different banks of the memory system. 

The forth stage, ("ALU/FPU") performs a single FPU or ALU (arithmetic) 
operations. 

Finally, the fifth stage, ("write") performs a write to the memory or a
conditional jump dependent on the result of an ALU/FPU operation.

Maybe some example might make this clearer:

1) a = b;           -- local variables

This compiles into ldl b; stl a, which forms a single group and executes in
one cycle. *ldl* is executed in the local stage, *stl* forms the address of
the local variable in the local stage and performs a write in the write stage
(remember the workspace cache is write through).

2) *a = *b

This compiles into ldl b; ldnl 0; ldl a; stnl 0 which forms a single group
and executes in 1 cycle. In the first stage of the pipeline the two *ldl*
instructions are executed. In the second stage the addresses to be
dereferenced are computed (b + 4*0, a + 4*0). In the third stage the load
occurs, and in the final stage the store occurs.

3) *a = *b;           -- local variables
   *c = *d;

This compiles into     ldl b; ldnl 0; ldl a; stnl 0 -- G1
                       ldl d; ldnl 0; ldl c; stnl 0 -- G2

which forms 2 groups and executes in two cycles. The second load from the
main cache will occur before the first store has occurred (provided that a
and d point to different locations).

4) a[i] = b[2] + c[3];

This compiles to 

                   ldl c; ldnl 3; ldl b; ldnl 2; add;  -- G1
                   ldl i; ldl a; wsub; stnl 0          -- G2

and forms into two groups corresponding to the two lines. 

On cycle 1 of execution the two *ldl* instructions of G1 execute (note the
*ldl b* executes before the preceeding *ldnl 3*).

On cycle 2 the address formation belonging to the two *ldnl* instructions 
(c + 4*3 and b + 4*2) of G1 execute, as do the two *ldl* instructions of G2.

On cycle 3 the two store accesses (reads) corresponding to the two *ldnl*
instructions of G1 execute, as does the address formation caused by the
*wsub* and *stnl* instructions (a + 4*i + 4*0) of G2.

On cycle 4 the addition of G1 is performed.

On cycle 6 the result of the addition in G1 is written into the address
generated in G2.

5) 8 instructions in one group 

               ldl target; wsub; ldnl 0; ldl; ldnl 3; diff; eqc 0; cj

This could occur during the execution of something like

               if (target[j] = text[3]) something;


---------------- END DESCRIPTION ------------------------------------------

Well, does this machine have out-of-order execution? It certainly executes
instructions in a different order to that written by the programmer, and
anyone with an LSI attached to the external memory bus can detect this if
they try. With that comment I'll leave the matter.

The more important issue is "What do compiler's have to do to exploit the
machine?". The T9000 executes naive transputer code pretty well - not
surprising given that that is what is was designed to do. There are certain
things a compiler should do to be simpathetic to the T9000, the most
important of which is that it should use addressing operations to do
addressing (e.g. it should use *bsub* to do byte subscription rather than
*sum*). Since the instruction timings of the T9000 are different from those
of the T805 an optimising compiler has to make different trade-offs. However,
unless the T9000 executes malicious code (e.g. no-op's interleaved into the
instruction stream) the grouper and pipeline work - normal code will run much
faster than on a T805 *of the same clock speed*.

--
Roger Shepherd, INMOS Ltd   JANET:    roger@uk.co.inmos 
1000 Aztec West             UUCP:     ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury                 INTERNET: roger@inmos.com
+44 454 616616              ROW:      roger@inmos.com OR roger@inmos.co.uk