Question: does the T9000 have "out-of-order execution" or not? Roger Shepherd replies: Yes. I'll clarify this. What I'll do is tell you roughly how the T9000 processor works - what follows is a first order approximation to the T9000's implementation. ---------------START DESCRIPTION -------------------------------------- The T9000's CPU gets its speed from a number of sources. The most important of these are: 1) quad-banked unified main cache - bank is determined by bits 4 and 5 of address - each bank is fully associative, 4 word per line write-back cache - cache may be disabled and used as RAM, or mixture of cache and RAM 2) workspace cache - 32 words, allows 2 reads and 1 write per cycle, write through 3) instruction grouper and execution pipeline I hope points 1 and 2 are self-explanatory, I will concentrate on point 3. The T9000 operates by "parsing" the instruction stream, forming sequences of instructions into "groups" and then executing the groups on a five stage pipeline. A group can contain up to 8 instructions. (Although this suggests a maximum performance of 8 ipc, the T9000 is fetch rate limited to 4 ipc - it is. however, very useful to be able to issue at the above rate since (i) some groups will take more than 1 cycle to execute, and (ii) it is not always possible to group/issue all the instructions which have been fetched.) So, what can a group contain? Well, a group can contain (nearly) anything that can be executed by the pipeline! So, what does the pipeline look like? See below: 1 2 3 4 5 LOCAL ADDRESS MAIN-CACHE ALU/FPU WRITE LOCAL ADDRESS MAIN-CACHE The first stage of the pipeline (the "local" stage) contains two parallel function units each capable of performing operations such as "load local" or "load constant". The second stage (the "addressing" stage) contains two parallel function units each capable of performing the addressing arithmetic needed by instructions such as "bsub", "wsub", "ldnl" etc. The third stage, ("main cache") permits two accesses to be made to the memory system - these can proceed simulataneously if the data are contained in different banks of the memory system. The forth stage, ("ALU/FPU") performs a single FPU or ALU (arithmetic) operations. Finally, the fifth stage, ("write") performs a write to the memory or a conditional jump dependent on the result of an ALU/FPU operation. Maybe some example might make this clearer: 1) a = b; -- local variables This compiles into ldl b; stl a, which forms a single group and executes in one cycle. *ldl* is executed in the local stage, *stl* forms the address of the local variable in the local stage and performs a write in the write stage (remember the workspace cache is write through). 2) *a = *b This compiles into ldl b; ldnl 0; ldl a; stnl 0 which forms a single group and executes in 1 cycle. In the first stage of the pipeline the two *ldl* instructions are executed. In the second stage the addresses to be dereferenced are computed (b + 4*0, a + 4*0). In the third stage the load occurs, and in the final stage the store occurs. 3) *a = *b; -- local variables *c = *d; This compiles into ldl b; ldnl 0; ldl a; stnl 0 -- G1 ldl d; ldnl 0; ldl c; stnl 0 -- G2 which forms 2 groups and executes in two cycles. The second load from the main cache will occur before the first store has occurred (provided that a and d point to different locations). 4) a[i] = b[2] + c[3]; This compiles to ldl c; ldnl 3; ldl b; ldnl 2; add; -- G1 ldl i; ldl a; wsub; stnl 0 -- G2 and forms into two groups corresponding to the two lines. On cycle 1 of execution the two *ldl* instructions of G1 execute (note the *ldl b* executes before the preceeding *ldnl 3*). On cycle 2 the address formation belonging to the two *ldnl* instructions (c + 4*3 and b + 4*2) of G1 execute, as do the two *ldl* instructions of G2. On cycle 3 the two store accesses (reads) corresponding to the two *ldnl* instructions of G1 execute, as does the address formation caused by the *wsub* and *stnl* instructions (a + 4*i + 4*0) of G2. On cycle 4 the addition of G1 is performed. On cycle 6 the result of the addition in G1 is written into the address generated in G2. 5) 8 instructions in one group ldl target; wsub; ldnl 0; ldl; ldnl 3; diff; eqc 0; cj This could occur during the execution of something like if (target[j] = text[3]) something; ---------------- END DESCRIPTION ------------------------------------------ Well, does this machine have out-of-order execution? It certainly executes instructions in a different order to that written by the programmer, and anyone with an LSI attached to the external memory bus can detect this if they try. With that comment I'll leave the matter. The more important issue is "What do compiler's have to do to exploit the machine?". The T9000 executes naive transputer code pretty well - not surprising given that that is what is was designed to do. There are certain things a compiler should do to be simpathetic to the T9000, the most important of which is that it should use addressing operations to do addressing (e.g. it should use *bsub* to do byte subscription rather than *sum*). Since the instruction timings of the T9000 are different from those of the T805 an optimising compiler has to make different trade-offs. However, unless the T9000 executes malicious code (e.g. no-op's interleaved into the instruction stream) the grouper and pipeline work - normal code will run much faster than on a T805 *of the same clock speed*. -- Roger Shepherd, INMOS Ltd JANET: roger@uk.co.inmos 1000 Aztec West UUCP: ukc!inmos!roger or uunet!inmos-c!roger Almondsbury INTERNET: roger@inmos.com +44 454 616616 ROW: roger@inmos.com OR roger@inmos.co.uk