Crisis in High Performance Computing - Conclusions

11th September 1995

Lecture room G22 (also known as the Pearson Lecture Theatre)
Pearson Building
University College London
Gower Street
London WC1E 6BT

Registration: 76
  Attendance: 70 (==> 6 either didn't show or showed up too late for
                  our registration desk)

Timetable of workshop

  09:50  Introduction to the Day
         (Professor Peter Welch, University of Kent)
  10:00  High performance compute + interconnect is not enough
         (Professor David May, University of Bristol)
  10:40  Experiences with the Cray T3D, PowerGC, ...
         (Chris Jones, British Aerospace, Warton)
  11:05  More experiences with the Cray T3D, ... (ABSTRACT)
         (Ian Turton, Centre for Computational Geography, University of Leeds)

  11:30  Coffee

  11:50  Experiences with the Meiko CS2, ... (ABSTRACT)
         (Chris Booth, Parallel Processing Section, DRA Malvern)
  12:15  Problems of Parallelisation - why the pain? (ABSTRACT)
         (Dr. Steve Johnson, University of Greenwich)

  13:00  Working Lunch (provided) [Separate discussion groups]

  14:20  Language Problems and High Performance Computing (ABSTRACT)
         (Nick Maclaren, University of Cambridge Computer Laboratory)
  14:50  Parallel software and parallel hardware - bridging the gap (ABSTRACT)
         (Professor Peter Welch, University of Kent)

  15:30  Work sessions and Tea [Separate discussion groups]

  16:30  Plenary discussion session
  16:55  Summary

  17:00  Close

Structure of workshop

Aims: set of questions presented at the start
Presentations by:
- Current users of MPP facilities
- Architects of parallel h/w and s/w
- Tool builders for parallel applications
Four separate workgroups for small group discussion, loosely themed around:
- Parallel Hardware (Chair: Dennis Parkinson, QMW)
- Parallel Software (Chair: Chris Clack, UCL)
- Models of Parallelism (Chair: Chris Wadsworth, DRAL)
- Parallel Applications (Chair: Chris Jones, BAe)
Plenary report session to hammer out conclusions.

Summary of workshop conclusions

[Status: this summary has been drafted by the Workshop Chair (Peter Welch, University of Kent) and has been ratified by the speakers at the workshop and the workgroup chairs. A fuller report of the proceedings will be published in due course - please check the URL:
<URL:/parallel/groups/selhpc/crisis/>
for further details.]

These are summarised through the answers achieved in the final plenary session to the questions posed at the start of the workshop.

Are users disappointed by their performance with current HPC?
Yes. In some quarters this was severe enough to be causing serious economic embarrassment and a recommendation to think hard before moving into HPC (especially for non-traditional users - like geographers). Some felt that this disappointment is partly the result from over-selling on the part of vendors and funders and local enthusiasts (e.g. "When the 40 Gflops (40 billion arithmetic operations per second), 256-processor Cray T3D is commissioned this Spring by the University's Computing Services it will be amongst the ten fastest computers in the world ...", from the "Supercomputer procurement - press release (3rd Feb 1994)"), which led to grossly raised expectations.
Is there a problem with the efficiency levels obtained by users of MPPs for real applications?
Yes. See below.
Does efficiency matter - given that absolute performance levels and value-for-money are better than the previous generation (vector) HPC machines?
Yes. Supercomputers are a scarce resource (about 2-and-a-half in the UK now available to academics) and user queues are inevitable. Lower efficiency levels mean longer waiting times to get your job turned around - this is the real killer to users, over and above the actual execution time achieved.
Comparison against efficiency levels for PCs is irrelevant. PCs sit on your desk. We can afford to over-resource them so we have access to them immediately (a classic real-time constraint). If supercomputers were similarly available, no one would worry about efficiencies ... (except those wanting interactive response). There is one other difference. If you have a problem that is too large for a workstation on your desk, you can use a supercomputer. If you have a problem that is too large for a supercomputer, you can either wait for the next generation of supercomputer or improve your program's efficiency.
Efficiency also seems to be an easily measured benchmark against which funding agencies (EPSRC and Industry) are judging projects.
Having said this, there certainly exists a range of problems that can only be solved on the current MPP machines - for them, there is no choice but to live with the long turn-arounds and accept low efficiencies.
(If there are any "yes" answers to the above) where do the problems lie?

Hardware architecture for MPPs

Real blame here. Quoting from David May's opening presentation: "Almost all the current parallel computers are based on commodity microprocessors without hardware assistance for communications or context switching. With the resulting imbalance it is not possible to context switch during communications delays and efficiency is severely compromised".
Parallel architecture had not picked up strongly enough on known problems (and their solutions) from the 1980s and had concentrated on building unattainable MFLOP/s and MBYTE/s. Considerable frustration was expressed at those wasted opportunities - we should be doing much better than we are. This has to change.

Software architecture for MPPs

Largely clobbered by the long latencies for communication startups and context switches. Forced to rely on large-grain parallelism (for computation and remote fetches / deposits) that is very difficult to balance (and only applicable to large problem sizes). Also suffers from the pressure to preserve serial paradigms for system design and implementation (so as to be able to exploit large existing codes). Too much machine specific knowledge needed for good performance. Mainstream parallel languages (e.g. HPF) or libraries (e.g. PVM and MPI) do little to hide this. Some astonishment that the shared-memory libraries for the T3D are so much used in preference to the BSP libraries that are known to scale well, are as efficient - and are portable. Need to look much more closely at BSP and occam (which is being worked on by the TDF / ANDF group and within the EPSRC PSTPA programme). Need to understand better and improve our understanding about disciplines for parallel sharing of resources (e.g. combining operators). Need to appreciate particular problems for parallelism inherited from serial languages (e.g. C and Fortran) - for example, the difficulties imposed on parallelisation tools from undetectable aliasing of names to objects. Need to bind concepts of parallelism into languages and / or design tools at a very high level of abstraction. Not much of this is happening.

Users' ability to work with the technology effectively

We should always strive to improve ourselves - education and training is vital. But this must not be constrained to training on differing varieties of Fortran and / or message-passing libraries! Efficiencies on the T3D seem to range from 1% through 17%, depending on the problem and who was solving it. Even the best groups only achieved 5% on some problems (e.g. conjugate gradients). Blame not laid here.

Note that these conclusions are the reverse of the normally received wisdom (which says that parallel hardware is brilliant, but the parallel software infrastructure to support it lags behind and users' abilities to exploit what is on offer are weak). This workshop suggests that users have a natural affinity for the parallelism inherent in their applications, that sound and scalable models for expressing that parallelism exist, but that current parallel hardware lacks some crucial technical parameters that are necessary for the execution of those expressions at a worthwhile level of efficiency.
Can we do better?

Current generation HPC

Yes - but not immediately. Users must be made aware of the impact of long comms-startup latencies, context switching, non-overlapped communications / computation and cache incoherency on their applications and trained how best to live with them. Research and development of tools, libraries and languages that minimise these problems should be a priority.

Next generation HPC

The announcement (by President Clinton) of the 1.8 TFLOP/s supercomputer comprising some 9000 Intel P6s for late 1996 was greeted with some caution. Nothing in the public announcement addressed any of the concerns that were being debated in this workshop. Without further information, the workshop wondered what MTBF rates and what efficiency levels would be achieved for an application running the full system. It was pointed out that a sustained 1% efficiency would yield 18 GFLOP/s for user applications ... and that this performance gain over most current facilities could be labelled a "success". It's too late to influence this next generation - hence, same answer as above.

Next-but-one generation HPC

There may not be one. Many MPP vendors have gone out of business and more may find the market unprofitable. The remaining supplier(s) might exploit a monopoly situation so that prices would rise again to the detriment of end users. The remaining supplier(s) would be derived from the commercially viable PC/games/embedded market place (which alone can afford the development budget) and may not be attuned to the technical needs for MPP systems and their users.
Nevertheless, it may be possible to influence such architecture (hardware and software) - in particular, latency and context switch times must move in line with computational performance and communications bandwidth. Machines designed from scratch - using the fastest commodity micro-cores for processor, memory, link and routing components and maintaining the correct balance as an overriding design constraint - would be easier to program, extract high efficiencies from and be closer to a general purpose machine. Such machines would obtain huge leverage from well-behaved models for parallelism - not least through the automatic control of cache coherency, without the need for hardware or software run-time checks and remedies. It will be necessary to re-cast our application software to conform to those disciplines - for some, it will be necessary to re-write them. Failure to make such changes will result in HPC becoming increasingly ineffective, which will be serious since the need for HPC looks set to increase. The technical knowledge to avoid this exists in great parts and can be developed considerably - if this is used, the future looks exciting and we can be optimistic.
Is there a crisis in HPC?
On the political front, there is no immediate crisis but there is disappointment. Access to HPC facilities still exists, although at a lower scale than many had hoped.
At the engineering level, there is a crisis. There has been little or no progress in MPP architecture over the past 5 years as manufacturers and their clients have pursued obvious goals (MFLOP/s and MBYTE/s) and not emphasised the twiddly bits (low startup latencies and context switches, portable and scalable models of parallelism, prevention of cache incoherency, ...) that are necessary to make them work properly.
The result is real difficulty even for experienced users, the scale of whose applications give them no choice but to accept the machines and live with the long turn-arounds. There is discouragement to new users from entering the fray, especially if they are from non-traditional HPC fields of application.
Herein lies the basis for a real political crisis that may be upon us soon. If the engineering problems are not resolved in the near future, pressure will build to close down (or, at least, not upgrade) existing HPC facilities that may be difficult to resist. Such pressure is already being felt in the USA and this feeling is not very comfortable.
Actions?
Educate, research, develop, publish and influence.
Teach high-level models of parallelism, independent of target architecture. Teach and research good models that scale, are efficient and can extract much more of the parallelism in the users' applications. Priorities are: correctness, efficiency and scalability, portability, reliability and clarity of expression. Maintenance of existing vector/serial codes is not relevant for the long term.
Be fundamental - don't be afraid to question the existing consensus, whether this be HPF, MPI, FP, CSP, BSP or whatever. Do not set up a single `centre of excellence' for the provision and dissemination of training and education in HPC.
Listen (and get manufacturers and funding organisations to listen) to real users. Don't go for raw performance (e.g. 1.8 TFLOP/s). Demand to know what will really be sustained and publish the answers and the reasons behind them. Do some real computer science on the performance and programmability of `grand challenge' machines (which may be difficult in the UK as the funding bodies for HPC and computer science seem to be entirely separate).
Don't necessarily expect to provide efficient HPC solutions for all problems that need them - some badly behaved ones may need to wait (these need to be characterised). Look to the embedded and consumer market for the base technologies of the future (e.g. video-on-demand servers and their supporting communications and switching) - influence and modify them to the special needs of HPC applications.
Don't just accept what is on offer from today's HPC - the hardware may have to be accepted, but software access to it may bear considerable improvement.
Don't do nothing!
Review progress in 12 months - another workshop? Meanwhile, work through suitable Internet newsgroups.
Disseminate results and concerns through newsgroups and archives (e.g. the Internet Parallel Computing Archive at <URL:/parallel/groups/selhpc/crisis/>).
Move the discussion beyond the UK.