The English Electric (EE) KDF9 was one of the most successful products of the early UK computer industry, and it continues to inspire great affection in former users. Developed as a second-generation successor to the DEUCE, and announced in 1960, KDF9 came into service in 1964 and was still in use in 1980 at the National Physical Laboratory. KDF9 offered about a third of the power of a Ferranti Atlas for about an eighth of the cost. (So much for “Grosch’s Law”!) It was also much more successful than Atlas in the marketplace, selling about 30 systems against a mere handful of the more famous Ferranti machines. A large KDF9 configuration was priced around £250,000 — the equivalent of £2,000,000 nowadays — and was about 200,000 times slower than the £2000 laptop computer on which this web page was edited. This increase in cost-effectiveness by a factor of 200,000,000 is the cumulative result of the exponential improvements that the semiconductor industry has delivered for the last 50 years, and represents a halving of the computer price/performance ratio every 20 months.
This paper — The English Electric KDF9 — gives a comprehensive overview of the KDF9 architecture, including software; and this — The Hardware of the KDF9 — is an extended account of the hardware, with tables defining the instruction set and order code. They are the first two instalments of a trilogy; I will add the final part — The Software of the KDF9 — in due course.
There are links to various primary source documents in the following. Some are in PDF format, but most are in the much more space–efficient DjVu format, particularly when the equivalent PDF is more than about 5MB in size. If you are unable to view documents in DjVu format you may be able to find equivalent (but much bigger) PDFs in this collection.
The KDF9 was one of the earliest fully hardware-secured multiprogramming systems. The Timesharing Director, its elegantly simple operating system, could run up to four programs concurrently, each being confined to its own core area by hardware relocation. A program had its own sets of working registers, among them an expression-evaluation stack, or NEST, and a subroutine-link stack, or SJNS, which were activated when that program was dispatched, so that context switching was efficient. A program could drive hardware I/O devices directly, but was limited by hardware checks to those that the Director had allocated to it. Any attempt to use an unallocated device, or to access unallocated core store, caused an error interrupt.
There was hardware-enforced mutual exclusion of access to I/O buffers. When a program blocked for that reason, or by voluntarily waiting for an I/O transfer to terminate, it was interrupted and Director switched to the program of highest priority that was not itself blocked. When a blockage cleared, and the I/O Control unit recognised that the newly unblocked program was of higher priority than the program currently running, it interrupted to allow an immediate context switch. There was even provision in the hardware to deal with priority inversion, whereby a high-priority program, say H, is forced to wait by a program of lower priority which has control of an I/O device that H needs to use.
Later KDF9 operating systems included multi-access features, usually with PDP–8 front ends to handle the interactive terminals. They included Eldon 2, at the University of Leeds; DEMOCRAT, at the National Physical Laboratories (NPL); EGDON, developed primarily by English Electric with assistance from the UK Atomic Energy Authority (UKAEA); and EGDON3/COTAN, developed primarily at UKAEA Culham Laboratories (the UK’s principal centre for fusion reactor research), with assistance from several KDF9-owning Universities.
G.O.L.D. — Glasgow On-Line Desks — was an experimental CRJE facility based on the EE Timesharing Director, developed during 1966-67 at Glasgow University. Never put into production, it was quickly superseded by COTAN. In some ways it can be seen as a primitive application of ideas more successfully deployed by Eldon 2. The G.O.L.D. manual is a mish-mash of user guide, reference manual, and implementation description: the concept of levels of abstraction had yet to make its mark. It is amusing to see that a chapter entitled “FOR BEGINNERS” starts with a quotation in ancient Greek that present-day Athenians struggle to construe — it means something like: practical experience trumps book-learning. And they claim that Universities have not dumbed down!
G.O.L.D. casts light on some of the difficulties that the designers of KDF9 software had to overcome. Chapter 6 of the manual describes the conventions used for representing the KDF9 character set in terms of the ASCII-based KSR-33 Teletype. It’s not a pretty picture — see the ‘simple’ example on pages 2-1 and 2-2. Chapter 10 is very valuable for its account of how to use PROMPT, the disc-oriented program development system provided by English Electric, upon which Eldon 2 also was based.
The two KDF9s operated by the UKAEA did not have the timesharing hardware described above (it was an extra-cost option), and so a more basic approach to efficient machine utilisation was adopted. EGDON was a classical FORTRAN job-stream system, of the same general type as IBSYS for the IBM 7090. Many of its features reflected a desire for some compatibility with the FORTRAN compilers written by the Atomic Weapons Research Establishment at Aldermaston for their IBM 7030 ‘Stretch’ computer.
A paper describing the original version of EGDON has been anthologised by Per Brinch Hansen in his book: Classic Operating Systems: From Batch Processing to Distributed Systems, Springer, 2001 (also available for download). It presents EGDON essentially as an automated program development system. Part 1 of the EGDON manual reflects this emphasis on programming. Chapter 9 describes the (non-interactive) command language; Chapter 10 contains a complete example of a small job that mixes modules in FORTRAN and Usercode. A later version of the system, EGDON 3, supported a disc-based filing system and provided the infrastructure for COTAN. At that point EGDON had clearly graduated as a fully-fledged OS.
Part 2 of the EGDON manual is devoted to operational and system-generation procedures. Reading this, I am struck by the thought that the implementors of COTAN achieved Iron Age results with Stone Age tools.
COTAN — the Culham Online Task Activation Network — an extension of EGDON, was one of the earliest fully-interactive multi-access systems. (Its prototype was COSEC — the Culham Online Single Experimental Console — described in a paper available for download.) The COTAN manual describes its teletype-oriented user interface (which I think was excellent for the time), and the structure of the file system. The COTAN command reference manual specifies the commands available to the online user, including the interactive COTAN variant of Whetstone Algol. A later release refined these facilities greatly, in particular making the file system much easier to use, and providing the facility to run compiled binary programs from a terminal.
The KDF9 Whetstone and Kidsgrove compilers were pioneering implementations of Algol 60.
Brian Wichmann, at NPL, modified the Whetstone interpreter to gather byte code execution-frequency statistics for a large number of programs. This data was used to synthesise the famous Whetstone Benchmark, a program that reproduces these statistics in a single run, permitting its use as a simple tool for estimating computer performance in scientific applications.
A variant of FORTRAN IV, called EGTRAN, was implemented on the EGDON system. Its most unusual feature was that a FUNCTION or SUBROUTINE subprogram could be declared RECURSIVE. In that case its local variables were held in a stack rather than static storage, permitting the subprogram to be invoked while it was already active.
FORTRAN was also available under Eldon 2, with a compiler written at Leeds University; and there were compilers for various “autocodes”: simple algebraic languages derived from compilers first written by Tony Brooker for the Ferranti Pegasus, Sirius, Mercury and Atlas computers. A later version of autocode on the KDF9, known as IMP, and written at the Edinburgh Regional Computer Centre (ERCC), was a very capable language used for system programming. COBOL never made it to the KDF9, but EE supplied a Report Program Generator (RPG). Bill Waite implemented his STAGE2 macroprocessor, the basis for the Mobile Programming System, on KDF9. The KDF9 STAGE2 manual contains an example of an EGDON job deck, and a short COTAN interactive dialogue.
This simple program gives a flavour of Usercode, the rather eccentric assembly language of the KDF9; it calculates Ackermann’s function and types the result on the Flexowriter. A major engineering application program in Usercode was written by Phil Runciman. Here it is.
The KDF9 (Usercode) Programming Manual is the best available guide to programming the KDF9. (An earlier, much less comprehensive version of the Usercode manual is available elsewhere on the WWW. The present version is believed to be definitive.)
The STAGE2 macroprocessor uses an unusual format for macro templates, mixing keywords and parameters arbitrarily. It is possible that this derives from the structure of Usercode, as Waite actually wrote a Usercode assembler that ran much faster than EE’s original (Note on rapid instruction analysis by table lookup, M. O’Halloran and W. M. Waite; The Computer Journal, 1966, pp.248-249).
For use with the GNU KDF9 emulator, ee9, there is now kal3, a modern Usercode assembler written by David Holdsworth. It runs natively on your own computer and generates a KDF9 machine code program that ee9 can load and run. An executable of kal3 is included with the download package for ee9. The original source code can be found some way down this page.
ee9, a KDF9 emulator has been developed with the ultimate objective of being able to run KDF9 object programs under the control of an original operating system. It is also intended as lasting and accessible documentation of the KDF9 architecture. The touchstone of its success has been to run the Whetstone Benchmark on its original platform: the Whetstone Algol programming system for the KDF9. According to the emulator’s virtual CPU time, the run took 416 seconds; Brian Wichmann measured 417 seconds on an actual KDF9 in 1972. The real CPU time used by the emulator on my MacBook Pro was less than 1.7 seconds, so ee9 is about 250 times faster than the original KDF9 hardware.
Running the Whetstone Benchmark natively on the MacBook Pro takes less than 400 microseconds! This ratio of 1.7s (for an emulated KDF9 running an interpretive language) to 400µs (for native Mac compiled code) — about 4250 to 1 — reflects a slowdown by a factor of about 65 for Whetstone interpretation by the KDF9 and by a further factor of about 65 for KDF9 emulation on the Intel Mac. Since the Whetstone interpreter was very skilfully programmed in Usercode, this speaks well for the efficiency of ee9, which was written in a wide-spectrum language, and with pedagogical clarity as a more important objective than performance. It is also testament to the quality of GNAT, the magnificent GNU Ada compiler.
Here is the redacted log of an emulation run in hardware boot mode, showing the bootstrap of a Director, its initialisation dialogue with the operator, and its response to some operator commands. Because the Director is internally multi-threaded, messages from separate internal activities are interleaved in the monitor typewriter (Flexowriter) log output. This run shows a Director thread suspending itself, and resuming later, when the resource it needs (a paper tape punch, PTP) becomes available.
Text in blue is a running activity log from the emulator itself. Material in green italics is explanatory commentary that I have added. Text in bold red is KDF9 console Flexowriter output. A semicolon in such a message turns the write operation to a read operation (this is a hardware feature of the KDF9 Flexowriter buffer), so what precedes the semicolon is output and what follows the semicolon is the computer operator’s input, shown in bold black — the Flexowriter had another hardware feature, whereby the red part of its two-colour typewriter ribbon was used for output and the black part for input, the switch being made (strangely) before the semicolon. In this case, the computer operator inputs commands (“TINTs”) to Director. “|” represents the KDF9 End Message character, which was actually typed as “→" by the Friden Flexowriter.
Executing a cold boot, in fast mode (without diagnostics). ________________________________________________________________________ P — this Director was read from Paper Tape KKT40E007UPU — Director announces itself TIME SHARING DIRECTOR 2464 WDS| — Note the size of this multiprogramming system! 02U01 — Director reads and echoes the attached-device list: 02U02 — the first two digits code the device type, 05U03 — where 02=TR, 05=5-track TP, 01=8-track TP, 03=LP, 10=MT; 01U04 — 'U' means Unallotted; 03U05 — and the last two digits give the buffer number. 10U07 — Note that buffers 15, 16 and 17 are not initially configured. 10U10 10U11 10U12 10U13 10U14 CORE MODULES;8.| — 8 modules = 32K words OUT 8 REEL NO;23001C.| — Tape Serial Number of first spool tape A-PROGRAM DETAILS| LEVELS;N.| — No A-program levels wanted DATE D/M/Y;23/9/69.| TIME ON 24-HOUR CLOCK HOURS/MINS;10/29.| TINT;L16/10.| — configure buffer 16 as a magnetic tape 10L14 /Iden<KOUTPUT0>,TSN -00-2339 — MT5, Loaded with a tape, Serial Number -00-2339, label KOUTPUT0 10L13 /Iden<MS-DUMP>,TSN 77777777 — MT4, Loaded with a tape, Serial Number 77777777, label MS-DUMP 10L12 /Iden<EFPBEAAG>,TSN -00-0552 — MT3, Loaded with a tape, Serial Number -00-0552, label EFPBEAAG 10L11 /Iden<_Z_E_R_O>,TSN -00-1498 — MT2, Loaded with a tape, Serial Number -00-1498, a work tape 10L10 /Iden<_Z_E_R_O>,TSN -00-1488 — MT1, Loaded with a tape, Serial Number -00-1488, a work tape 10L07 /Iden<_Z_E_R_O>,TSN -00-1478 — MT0, Loaded with a tape, Serial Number -00-1478, a work tape TINT;L15/10.| — configure buffer 15 as a magnetic tape 10L16 /Iden<EXAMTIME>,TSN -66-1909 — MT7, Loaded with a tape, Serial Number -66-1909, label EXAMTIME 10L15 /Iden<PRINTEND>,TSN -00-9299 — MT6, Loaded with a tape, Serial Number -00-9299, label PRINTEND TINT;G.| — request output of device status list 02U01 — TR0, Unallotted paper tape reader, a.k.a. PTR 02U02 — TR1, Unallotted paper tape reader 05U03 — TP1, Unallotted 5-hole paper tape punch, a.k.a. PTP 01U04 — TP0, Unallotted 8-hole paper tape punch 03U05 — LP0, Unallotted, Line Printer 10L07 _Z_E_R_O — MT0, Loaded with a work (zero) magnetic tape 10L10 _Z_E_R_O — MT1, Loaded with a work (zero) magnetic tape 10L11 _Z_E_R_O — MT2, Loaded with a work (zero) magnetic tape 10L12 EFPBEAAG — MT3, Loaded with a labelled magnetic tape 10L13 MS-DUMP — MT4, Loaded with a labelled magnetic tape 10L14 KOUTPUT0 — MT5, Loaded with a labelled magnetic tape 10L15 PRINTEND — MT6, Loaded with a labelled magnetic tape 10L16 EXAMTIME — MT7, Loaded with a labelled magnetic tape TINT;T0/10000.| — load a program from TR1 into level 0, with 10000 words of core 0 BPIU 02P — TR1 (on buffer 2) Allotted to load program level 0/P 0 CRNP F2 — Cannot Read New Program: F2 = “bad A-block” 02U02 — TR1 Unallotted TINT;T1/6000.| — load a program from TR1 into level 1, with 6000 words of core 1 BPIU 02Q — TR1 (on buffer 2) Allotted to load program level 1/Q 1 CRNP F2 — Cannot Read New Program: again - there is still no tape in the reader! 02U02 — TR1 Unallotted TINT;L3/1.| — change type of TP1 to 8-track 01U03 TINT;M0/20.| — output 16 words from 0 as syllables in octal to TP0 01A04D — TP0 (on buffer 4) Allotted to Director 01U04 — TP0 Unallotted TINT;L4/0.| — take device on buffer 4 (TP0) out of service TINT;M0/20.| — output 16 words from 0 as syllables in octal to TP0 TINT M HELD UP-NO PTP. — TINT M thread suspends itself: TP0 unavailable TINT;L4/1.| — change type of buffer 4 back to to PTP 01U04 01A04D — TINT M thread automatically resumes, TP0 Allotted to Director 01U04 — TINT M thread completes, TP0 Unallotted ^C — break-in to emulator’s debugger ee9: Breakpoint (f:ast | t:race | p:ause or q:uit)? Q Run stopped by user! ________________________________________________________________________
You might like to compare this emulation output with an actual Flexowriter log, preserved by David Holdsworth, of a session on the 29th of September 1969, running the later and more taciturn Eldon 2 Director:
Here is the output on TP0 from the “M0/20.|” commands:
TINT M OUTPUT 000 000 151 324 061 313 066 062 327 035 030 111 006 131 154 253 365 277 353 364 051 264 131 101 067 012 165 046 127 131 302 233 115 226 020 161 262 220 103 374 065 371 204 220 000 000 000 001 000 002 000 016 210 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 001 014 157 172 000 000 000 317 170 170 377 377 377 377 377 377 TINT M OUTPUT 000 000 151 324 061 313 066 062 327 035 030 111 006 131 154 253 365 277 353 364 051 264 131 101 067 012 165 046 127 131 302 233 115 226 020 161 262 220 103 374 065 371 204 220 000 000 000 001 000 002 000 016 210 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 001 014 157 172 000 000 000 377 377 377 377 377 377 377 377 377
KDF9 was one of the first computers to support Algol 60, the international algorithmic language that is the common ancestor of many of today’s most widely used programming languages. In fact, KDF9 was supplied with two Algol 60 compilers that ran under the Timesharing Director: the Kidsgrove (Kalgol) and Whetstone (Walgol) systems, which were named after the EE installations where they were written. The KDF9 Algol Manual describes the common features of the Walgol and Kalgol implementations.
Kalgol aimed at efficient machine code and included an optional global optimisation pass. The compiler had an unfortunate reputation for being slow and the optimiser was accused of generating unreliable object code. Its ambition for efficient machine code was not fully realised, among the problems being poor handling of the KDF9’s unique complement of registers. In particular, to avoid overfilling the nest, Kalgol took a very conservative approach to nest management, which nullified most of its advantages.
Walgol was a checkout compiler: aimed at fast compilation and comprehensively checked execution, it generated byte code for an abstract machine that bears a remarkable resemblance to the machine code of the (later) Burroughs B6500. Native to the paper tape oriented EE Timesharing system, Walgol was adapted to card input by John Patterson (JWP) et al., in the guise of Incore Algol, which served as a batch compiler for large numbers of short student jobs under EGDON. Transplanted to COTAN, Walgol provided a facility for online compilation and interactive execution.
Walgol was so well documented in ALGOL 60 Implementation, B. Randell and L.J. Russell, Academic Press; 1964 — a groundbreaking book (also in OCRed DjVu format) — that it engendered Algol 60 implementations on many other computers, including the EE DEUCE, the Ferranti Pegasus, the NPL ACE, the EE KDF6, the Soviet Minsk range, the EE System 4/50, the IBM System 360/25, the Phillips PRS8000, the Elliott 903, and the Indian ECIL TDC–316. The Elliott 903 version has been used by Andrew Herbert to cross check the ee9 implementation of an example program from the KDF9 Algol Manual!
Until very recently, the influence of the Whetstone system on the Burroughs large-systems architecture had remained plausible, but conjectural. The B5500 had 12-bit instructions and no up-level addressing (so it could not fully implement Algol 60 non-locals and closures). Its successor, the B6500, had variable-length instructions consisting of 1 or more 8-bit syllables, and implemented a hardware display for addressing non-locals, just like the Whetstone virtual machine. Bill McKeeman has now given first-hand evidence that these similarities are no coincidence. He writes: ‘I ended up consulting for Burroughs Pasadena on the B6500 design. Ben Dent was struggling with generalising the B5500 stack to handle up-level addressing. So I told him how. During each weekly visit we discussed the problem until he asked “How do you know all of this?” So I said, it is in a book — ALGOL 60 Implementation. He was irritated at not knowing sooner, got the book, and did not need to talk to me about it again. So I think it is fair to say that whatever he needed, he got from the book.’
The EGDON operating system eventually gained a third Algol 60 compiler, written by the implementers of Kalgol with the benefit of hindsight. It eschewed global optimisation, but included a very good peephole optimiser, and combined respectable compilation speed with reasonably fast object code and decent run-time checking. Unlike Walgol and Kalgol, it generated relocatable object files that could be be linked with the output of the EGTRAN (EGDON FORTRAN IV) and UCA3 (EGDON Usercode) compilers.
Here is one of the first full runs of Whetstone Algol in over 30 years. It is compiling and running one of the most complex examples from ALGOL 60 Implementation, namely Knuth’s GPS procedure. There are two test cases included, the second being multiplication of matrices. It achieves this by insanely devious use of name parameters and side effects (“Jensen’s Device”, pushed to the limit).
In this emulation run, operator input is requested by the Whetstone system, not the Director. The Translator asks for the “stream” number of a spooled output file. This is given either as N, to suppress the output, or as 30, the stream number for output to be spooled to the lineprinter (LP0). The Algol source program is read from a paper tape reader. The object code is kept in core, ready to be run by the Controller, the Whetstone abstract machine object code interpreter, which the compiler overlays on itself at the end of a successful compilation.
Running a problem program in fast mode (without diagnostics). — i.e. the Whetstone compiler ________________________________________________________________________ OUT;N.| — No compilation listing wanted ee9: OUT 5: requests device of type 2; gets TR1. — the compiler reads the source program from TR1 on buffer #2 GPS|RAN/EL/000M02S/000M05S SIZE 247 — the compilation took 2s of CPU time and 5s elapsed — the Whetstone virtual object code was 247 8-bit syllables long ee9: OUT 8: stream #10; closed. ee9: OUT 6: releases TR1. ee9: OUT 1: ICR = 361787; RAN/EL = 2175985 / 10048149 KDF9 us. — so the housekeeping at the end of compilation took a further 5s elapsed ee9: OUT 1: the Whetstone Controller is to follow the Translator. ee9: Reattached TR0 to [Binary/KMW0301--UPU] Running a problem program in fast mode (without diagnostics). ________________________________________________________________________ STREAM;30.| — stream spooled output to LP0 AD - S ee9: OUT 8: stream #30; closed. AD 30 CLOSED RAN/EL/000M04S/000M05S — end of run of the Algol object program ee9: OUT 1: ends the run of the Whetstone Controller. ________________________________________________________________________ End of Run. ...
Here is the program’s output on LP0. Note the transliteration by Whetstone’s output routines of “[” and “]’ to “*(” and “*)” for output to the KDF9 lineprinter, which lacks square brackets:
GPS TEST CASE 1 A*(I,J*) SHOULD BE (I+J) +2.00000 +3.00000 +4.00000 +3.00000 +4.00000 +5.00000 +4.00000 +5.00000 +6.00000 +5.00000 +6.00000 +7.00000 +6.00000 +7.00000 +8.00000 GPS TEST CASE 2 C*(I,J*) SHOULD BE 1.234I +1.23400 +1.23400 +1.23400 +1.23400 +1.23400 +2.46800 +2.46800 +2.46800 +2.46800 +2.46800 +3.70200 +3.70200 +3.70200 +3.70200 +3.70200 RAN/EL/000M04S/000M05S
Here is the Algol 60 test program itself. The Flexowriter had a “non-escaping” underline character, which did not advance the typewriter carriage, so that it overstruck the following character. Underlining was used to distinguish an identifier such as “begin” from the unrelated Algol word symbol “begin”. For clarity, the Algol word symbols below have been stripped of their underlining, because “begin” shows up as “_b_e_g_i_n” these days. “_[” and “_]” are Algol 60 string quotes.
GPS| begin real procedure GPS (I, N, Z, V); real I, N, Z, V; begin for I := 1 step 1 until N do Z := V; GPS := 1 end GPS; procedure GPS TEST 1 (m, n); value m, n; real m, n; begin real i, j; array A[1:m, 1:n]; writetext(30, _[GPS*TEST*CASE*1_[c_]_]); comment set A[i,j] to (i+j) ; i := GPS(j, n, i, GPS(i, m, A[i,j], i+j)); writetext(30, _[A[i,j]*should*be*(i+j)_[c_]_]); for i := 1 step 1 until m do begin for j := 1 step 1 until n do write(30, format(_[+d.dddddss_]), A[i,j]); writetext(30, _[_[c_]_]); end; end GPS TEST 1; procedure GPS TEST 2 (m, n, p); value m, n, p; real m, n, p; begin real i, j, k; real q; array A[1:m, 1:n], B[1:n, 1:p], C[1:m, 1:p]; writetext(30, _[_[c_]_[c_]GPS*TEST*CASE*2_[c_]_]); for i := 1 step 1 until m do for k := 1 step 1 until n do A[i,k] := i/10.0^(k-1); for k := 1 step 1 until n do for j := 1 step 1 until p do B[k,j] := k; for i := 1 step 1 until m do for j := 1 step 1 until p do C[i,j] := -999999999; comment set C to AB ; i := GPS(i, 1.0, C[1,1], 0.0) × GPS(i, (m-1) × GPS(j, (p-1) × GPS(k, n, C[i,j], C[i,j] + A[i,k]×B[k,j]), C[i,j+1], 0.0 ), C[i+1,1], 0.0 ); writetext(30, _[C[i,j]*should*be*1.234i_[c_]_]); for i := 1 step 1 until m do begin for j := 1 step 1 until p do write(30, format(_[+d.dddddss_]), C[i,j]); writetext(30, _[_[c_]_]); end; end procedure GPS TEST 2; GPS TEST 1(5, 3); GPS TEST 2(3, 4, 5); end |
Here the emulator is running the Whetstone Benchmark. For the sake of interest, the actual CPU time used on my MacBook Pro laptop is shown at the end of the run. The emulation is more than 250 times faster than the real KDF9! Reading-in the Whetstone Translator and Controller programs from paper tape would have taken nearly 51 (real) KDF9 seconds. This time was not included in the Director’s (or ee9’s) measurement of the elapsed time, but was accounted to system overhead.
Running a problem program in fast mode (without diagnostics). — i.e. the Whetstone compiler ________________________________________________________________________ OUT;N.| ee9: OUT 5: requests device of type 2; gets TR1. — the compiler reads the source program from TR1 on buffer #2 WHETSTONEBMK|RAN/EL/000M05S/000M15S SIZE 603 — the compilation took 5s of CPU time and 15s elapsed — the Whetstone virtual object code was 603 8-bit syllables long ee9: OUT 8: stream #10; closed. ee9: OUT 6: releases TR1. ee9: OUT 1: ICR = 813081; RAN/EL = 4816083 / 19741869 KDF9 us. ee9: OUT 1: the Whetstone Controller is to follow the Translator. ee9: Reattached TR0 to [Binary/KMW0301--UPU] Running a problem program in fast mode (without diagnostics). ________________________________________________________________________ STREAM;30.| AD - S ee9: OUT 8: stream #30; closed. AD 30 CLOSED RAN/EL/006M56S/006M57S — end of run of the Algol object program ee9: OUT 1: ends the run of the Whetstone Controller. ________________________________________________________________________ End of Run. TR1 on buffer #2 transferred 5416 bytes. TP0 on buffer #4 transferred 525 bytes. FW0 on buffer #0 transferred 127 bytes. LP0 on buffer #5 printed 13 lines. real 0m1.686s user 0m1.675s sys 0m0.003s
Here is the program’s output on LP0. It has been verified by running the Whetstone benchmark, in Pascal as well as Algol 60, both natively on my MacBook Pro and in emulation on two ICL 1900 series models (with thanks to Bill Gallacher, Brian Spoor and David Holdsworth).
N= 0 J= 0 K= 0 X1=+1.00000000 X2=-1.00000000 X3=-1.00000000 X4=-1.00000000 N= 120 J= 140 K= 120 X1=-0.06834220 X2=-0.46263766 X3=-0.72971839 X4=-1.12397907 N= 140 J= 120 K= 120 X1=-0.05533645 X2=-0.44743656 X3=-0.71097339 X4=-1.10309806 N=3450 J= 1 K= 1 X1=+1.00000000 X2=-1.00000000 X3=-1.00000000 X4=-1.00000000 N=2100 J= 1 K= 2 X1=+6.00000000 X2=+6.00000000 X3=-0.71097339 X4=-1.10309806 N= 320 J= 1 K= 2 X1=+0.49040732 X2=+0.49040732 X3=+0.49039250 X4=+0.49039250 N=8990 J= 1 K= 2 X1=+1.00000000 X2=+1.00000000 X3=+0.99993750 X4=+0.99993750 N=6160 J= 1 K= 2 X1=+3.00000000 X2=+2.00000000 X3=+3.00000000 X4=-1.10309806 N= 0 J= 2 K= 3 X1=+1.00000000 X2=-1.00000000 X3=-1.00000000 X4=-1.00000000 N= 930 J= 2 K= 3 X1=+0.83466552 X2=+0.83466552 X3=+0.83466552 X4=+0.83466552 RAN/EL/006M56S/006M57
WHETSTONEBMK| begin real x1, x2, x3, x4, x, y, z , t, t1, t2; array e[1:4]; integer i, j , k, l, n1, n2, n3, n4, n5, n6, n7, n8, n9, n10, n11; procedure pa(ep); real array ep; begin integer j ; j := 0; lab: ep := (ep + ep + ep - ep) × t; ep := (ep + ep - ep + ep) × t; ep := (ep - ep + ep + ep) × t; ep := (-ep + ep + ep + ep) / t2; j := j + 1; if j < 6 then goto lab; end procedure pa; procedure p0; begin e[j] := e[k]; e[k] := e[l]; e[l] := e[j]; end procedure p0; procedure p3(x, y, z ); value x, y; real x, y, z ; begin x := t × (x + y); y := t × (x + y); z := (x + y)/t2; end procedure p3; procedure pout(n, j , k, x1, x2, x3, x4); value n, j , k, x1, x2, x3, x4; integer n, j , k; real x1, x2, x3, x4; begin writetext(30, _[n=_]); write(30, format(_[nnnd_]), n); writetext(30, _[*j=_]); write(30, format(_[nnnd_]), j); writetext(30, _[*k=_]); write(30, format(_[nnnd_]), k); writetext(30, _[*x1=_]); write(30, format(_[+d.dddddddd_]), x1); writetext(30, _[*x2=_]); write(30, format(_[+d.dddddddd_]), x2); writetext(30, _[*x3=_]); write(30, format(_[+d.dddddddd_]), x3); writetext(30, _[*x4=_]); write(30, format(_[+d.dddddddd_]), x4); writetext(30, _[_[c_]_]); end procedure pout; open(30); comment initialise constants; t := 0.499975; t1 := 0.50025; t2 := 2.0; comment read i, controlling total weight: if i=1 the total weight is one million Whetstone instructions: Walgol ran at 2.4KWIPS, so this should take 417 KDF9 CPU seconds; i := 1; i := i × 10; n1 := 0; n2 := 12 ×i; n3 := 14 ×i; n4 := 345 ×i; n5 := 0; n6 := 210 ×i; n7 := 32 ×i; n8 := 899 ×i; n9 := 616 ×i; n10 := 0; n11 := 93 ×i; comment module 1: simple identifiers; x1 := 1.0; x2 := x3 := x4 := -1.0; for i := 1 step 1 until n1 do begin x1 := (x1 + x2 + x3 - x4) ×t; x2 := (x1 + x2 - x3 + x4) ×t; x3 := (x1 - x2 + x3 + x4) ×t; x4 := (-x1 + x2 + x3 + x4) ×t; end module 1; pout(n1, n1, n1, x1, x2, x3, x4); comment module 2: array elements; e := 1.0; e := e :=e := -1.0; for i := 1 step 1 until n2 do begin e := (e + e + e - e) × t; e := (e + e - e + e) × t; e := (e - e + e + e) × t; e := (-e + e + e + e) × t; end module 2; pout(n2, n3, n2, e, e, e, e); comment module 3: array as parameter; for i := 1 step 1 until n3 do pa(e); pout(n3, n2, n2, e, e, e, e); comment module 4: conditional jumps; j := 1; for i := 1 step 1 until n4 do begin if j = 1 then j := 2 else j := 3; if j > 2 then j := 0 else j := 1; if j < 1 then j := 1 else j := 0; end module 4; pout(n4, j , j , x1, x2, x3, x4); comment module 5: omitted; comment module 6: integer arithmetic; j := 1; k := 2; l := 3; for i := 1 step 1 until n6 do begin j := j × (k - j ) × (l - k); k := l × k - (l - j ) × k; l := (l - k) × (k + j ); e[l - 1] := j + k + l; e[k - 1] := j × k × l; end module 6; pout(n6, j , k, e, e, e, e); comment module 7: trig. functions; x := y := 0.5; for i := 1 step 1 until n7 do begin x := t × arctan(t2 × sin(x) × cos(x) / (cos(x + y) + cos(x - y) - 1.0)); y := t × arctan(t2 × sin(y) × cos(y) / (cos(x + y) + cos(x - y) - 1.0)); end module 7; pout(n7, j , k, x, x, y, y); comment module 8: procedure calls; x := y := z := 1.0; for i := 1 step 1 until n8 do p3(x, y, z ); pout(n8, j , k, x, y, z , z ); comment module 9: array references; j := 1; k := 2; l := 3; e := 1.0; e := 2.0; e := 3.0; for i := 1 step 1 until n9 do p0; pout(n9, j , k, e, e, e, e); comment module 10: integer arithmetic; j := 2; k := 3; for i := 1 step 1 until n10 do begin j := j + k; k := j + k; j := k - j ; k := k - j - j ; end module 10; pout(n10, j , k, x1, x2, x3, x4); comment module 11: standard functions; x := 0.75; for i := 1 step 1 until n11 do x := sqrt(exp(ln(x)/t1)); pout(n11, j , k, x, x, x, x); end |
At one point in the testing of ee9 I suspected that it was evaluating the Algol arctan function wrongly, so I looked carefully at the Usercode routine that implements it. Here it is:
P51V15; (arctan EL); V0=F0.019042127887; V1=F0.019042129240; V2=F0.038082414120; V3=F0.076666493927; V4=F0.121226383896; V5=F0.725940450930; (V6-V10 used for b); V11=Q6/1/0; V12=Q4/1/0; (V13 used for a); V14=F1.0; V15=F0.5; V14; =V13; DUP; DUP; ×F; V14; +F; JSP40; =V6; V12; =Q13; 2; V13; V6M13; +F; V15; ×F; =V13; V13; V6M13Q; ×F; JSP40; =V6M13; J2C13NZ; V11; =Q13; V0M13Q; ZERO; REV; FIX; FLOATD; 1; V0M13; V5M13Q; ×+F; J1C13NZ; ROUNDF; ÷F; EXIT1;In case it is not obvious (!), the loop at label 2 uses the arithmetic-geometric mean (AGM) algorithm, described here, to develop an approximation. The Mathematica page suggests taking it to the limit, but that would be expensive, requiring the extraction of a square root in each of many iterations. Instead we have the rather mysterious loop at label 1, which uses the KDF9 double precision multiply-and-accumulate instruction (×+F) to form the scalar product of the AGM approximants and a vector of constants (held in V0-V5). This suggests a convergence-acceleration procedure, but it is certainly not one of the better-known methods, such as Aitken’s delta-squared process.
As it turned out, the problem I was trying to diagnose lay elsewhere, but I was left with the niggling dissatisfaction of not fully understanding how P51V15 worked. I then had the bright idea of enlisting the help of a former colleague, Michael Jamieson. He attacked the problem enthusiastically, and produced this paper, which explains everything.
I recently wondered how well the KDF9 algorithm of half a century ago compares with modern techniques, which presumably benefit from decades of advance in numerical analysis. So I wrote a test program, translating P51V15 from Usercode to Ada 2012, and raced it against the library function provided by the 2013 release of GNU Ada. I expected to find that today’s sophisticated technology would leave 1963’s method trailing in its dust.
To my astonishment, P51V15 was faster.
Digging down, I found that Ada.Numerics.Long_Elementary_Functions.Arctan does quite a bit of argument and result checking, but basically invokes the fpatan instruction of the x86_64 CPU.
The 1963 algorithm, implemented by software written in Ada 2012, is only 17% slower than that one instruction, implemented (I assume) by microcode! Even better, although P51V15 was designed for a machine with 48-bit floating point, it gets the same accuracy as the modern hardware method on a machine with 64-bit floating point.
The convergence acceleration process of P51V15, explained by Michael Jamieson in his paper, is remarkably effective. Without it, 24 AGM cycles, taking nearly 7 times longer overall, are needed to obtain the same accuracy as P51V15 actually gets from just 4 AGM cycles. It is also responsible for attaining 64-bit accuracy with an algorithm designed for 48 bits.
The difference in time between the Ada library Arctan and the fpatan function is due to the aforementioned argument and result checking. The P51V15 time does not include any of that overhead either: the programmers of 1963 perforce put the burden of such matters on the user, to a much greater extent than we rightly expect to bear nowadays; but they also had fewer peculiarities to contend with, there being no NaN or ±Inf values in the KDF9 floating point types! That said, my admiration for their work, already profound, has only been increased by these new results.
By the way, P51V15 is about 39,000 times faster in Ada on my laptop than it was in Usercode on the real KDF9!
Here is the test program, with the observed results included as commentary:
with Ada.Numerics.Long_Elementary_Functions; with Ada.Text_IO; with CPU_Timing; with System.Machine_Code; use Ada.Numerics.Long_Elementary_Functions; use Ada.Text_IO; use CPU_Timing; use System.Machine_Code; procedure arctan_test is -- typical output: -- R = 100000000 evaluations -- checksum = 5.00000005000000E+07, loop time per repetition = 6 ns -- checksum = 4.38824577044418E+07, P51V15 time per evaluation = 49 ns -- checksum = 4.38824577044418E+07, Arctan time per evaluation = 53 ns -- checksum = 4.38824577044418E+07, fpatan time per evaluation = 41 ns -- P51V15 is the KDF9 algorithm used in Whetstone Algol, ca. 1963 -- See http://www.findlayw.plus.com/KDF9/Arctan%20Paper%20by%20MJJ.pdf type vector is array (0..5) of Long_Float; V : constant vector := ( 28165298.0 / 1479104550.0, 28165300.0 / 1479104550.0, 56327872.0 / 1479104550.0, 113397760.0 / 1479104550.0, 179306496.0 / 1479104550.0, 1073741824.0 / 1479104550.0); function P51V15 (x : Long_Float) return Long_Float with Inline; function P51V15 (x : Long_Float) return Long_Float is A : Long_Float := 1.0; S : Long_Float := V(0); B : vector; begin B(0) := sqrt(x*x + 1.0); -- 4 AGM (Arithmetic-Geometric Mean) cycles give a set of values ... for i in 0..3 loop A := (A + B(i)) * 0.5; B(i+1) := sqrt(A * B(i)); end loop; -- ... that is subjected to a convergence acceleration process: for i in 1..5 loop S := S + V(i)*B(i-1); end loop; return x / S; end P51V15; -- this is the hardware arctan function of the x86_64 Core i7 CPU function fpatan (X : Long_Float) return Long_Float with Inline; function fpatan (X : Long_Float) return Long_Float is Result : Long_Float; begin Asm(Template => "fld1" & Character'Val(10) -- LF & Character'Val(9) -- HT & "fpatan", Outputs => Long_Float'Asm_Output ("=t", Result), Inputs => Long_Float'Asm_Input ("0", X)); return Result; end fpatan; R : constant := 1e8; -- number of loop repetitions function ns_per_rep (c : CPU_Usage_in_Microseconds) return Natural is begin return Natural(c * 1e3 / R); end ns_per_rep; x : Long_Float; c : CPU_Timer; l : CPU_Usage_in_Microseconds; t : CPU_Usage_in_Microseconds; begin Put_Line("R =" & Integer'Image(R) & " evaluations"); -- determine the fixed overhead time x := 0.0; Reset_Timer(c); for i in 1..R loop x := x + Long_Float(i)/Long_Float(R); end loop; l := User_CPU_Time_Since(c); Put_Line("checksum =" & Long_Float'Image(x) & ", " & "loop time per repetition =" & Natural'Image(ns_per_rep(l)) & " ns"); x := 0.0; Reset_Timer(c); for i in 1..R loop x := x + P51V15(Long_Float(i)/Long_Float(R)); end loop; t := User_CPU_Time_Since(c) - l; Put_Line("checksum =" & Long_Float'Image(x) & ", " & "P51V15 time per evaluation =" & Natural'Image(ns_per_rep(t)) & " ns"); x := 0.0; Reset_Timer(c); for i in 1..R loop x := x + Arctan(Long_Float(i)/Long_Float(R)); end loop; t := User_CPU_Time_Since(c) - l; Put_Line("checksum =" & Long_Float'Image(x) & ", " & "Arctan time per evaluation =" & Natural'Image(ns_per_rep(t)) & " ns"); x := 0.0; Reset_Timer(c); for i in 1..R loop x := x + fpatan(Long_Float(i)/Long_Float(R)); end loop; t := User_CPU_Time_Since(c) - l; Put_Line("checksum =" & Long_Float'Image(x) & ", " & "fpatan time per evaluation =" & Natural'Image(ns_per_rep(t)) & " ns"); end arctan_test;
The acceleration constants in the Usercode are given by floating point literals computed as ratios of the integral numerators and denominators listed in Michael Jamieson’s paper. The Ada code does better by setting out the fractions explicitly; this allows the compiler to portably determine their values to the full precision of the Long_Float type, whatever that may be on a particular computer, with no loss due to converting to binary from a decimal approximation.
Ada’s System.Machine_Code library package allows the inclusion of assembly code, and the Inline aspect in the function specifications tells the compiler to “macro-expand” the function bodies, if that would be beneficial. This makes little difference to the efficiency of P51V15, but preserves the full benefit of direct access to machine code in fpatan.
The Usercode has one apparent advantage over the Ada: it forms the scalar product in double-precision (96-bit) arithmetic, by dint of the ×+F instruction. ×+F develops a double-precision product and accumulates a double-precision sum that is rounded to 48 bits only when complete. Like most high-level languages, Ada has no built-in equivalent of this, although such an function could easily be programmed if supported by the hardware. Despite that disadvantage, it gets an accurate result by working entirely in 64 bits.
One of the mathematical epics of the 20th century was the effort to discover and characterise all of the finite simple groups. Contributing to this struggle, it was useful to be able to find out whether a simple group of a specific size could exist. Counting the cosets of a group was often a useful step in solving this latter problem.
John Leech, who was then in the Computing Laboratory at Glasgow University, wrote a coset enumeration program based on the Todd-Coxeter Algorithm, with refinements by Leech and by Brian Haselgrove. The HLT algorithm was programmed in KDF9 Usercode and ran on the Glasgow machine. But the Glasgow KDF9 had only 16KW of core store at the time, and this proved insufficient for some of the groups Leech wanted to investigate. Brian Wichmann, who had a doctorate in group theory, and who knew Leech, was working at the National Physical Laboratory (NPL). He was asked to run Leech’s program on the KDF9 at NPL, which had a 32KW store.
During this work, Leech discovered the 24-dimensional lattice that bears his name. The Leech lattice is closely related to several very large simple groups that turned out to be subquotients of the Monster, the largest of the finite simple (!) groups: a truly gigantic structure that represents the culmination of the group classification project.
Fortunately, Wichmann has preserved his correspondence with Leech, including program and data listings, and recently typed them up. So Leech’s program has now been run, probably for the first time since 1965, under ee9. Here is a log of an ee9 run, processing a test case that took 34 minutes of CPU time in 1965; it takes less than 7 seconds in emulation.
Running a problem program in fast mode (without diagnostics). ________________________________________________________________________ ee9: OUT 5: requests device of type 2; gets TR1. ee9: OUT 5: requests device of type 1; gets TP0. ee9: OUT 6: releases TP0. ee9: OUT 0: end of run. ________________________________________________________________________ End of Run. TR1 on buffer #2 transferred 118 bytes. TP0 on buffer #4 transferred 125 bytes. real 0m6.806s user 0m6.791s sys 0m0.006s
Here is the output. The program first copies its data from a tape reader (TR1) to a tape punch (TP0) and then, after a long calculation, it appends the number of cosets it has found (in this case, 1045).
AEFDBC — This line means AA=1, BE=1, CF=1, DD=1, EB=1, FC=1 | A — These lines ask for the cosets of A, B and C B C | — A finite presentation of the group follows ... CCC — ... i.e., CCC = 1, FACA = 1, and so on. FACA DFDC DBDB FBCEE BBBBBBB AEABAEAB AEEAEABBB ADADADADAD ABCDEABCDEABCDEABCDEABCDEABCDEABCDE | 1045| — This is the number of cosets found.
Other valuable KDF9 websites include those operated by David Holdsworth and by Brian Wichmann at the Computer Conservation Society. In particular, Brian Randell had preserved a treasure trove of KDF9 engineering documents, which he has donated to the project; they have been scanned and put online by Brian Wichmann.
Several images of KDF9 are available from former KDF9-owning sites, including Leeds University, and Birmingham University. The Flexowriter console is shown in this picture. It is just possible to see that it is logging a session run under EGDON. Eric Foxley has preserved these pictures.
If you have any corrections or additions to the KDF9 information in these pages, I would be grateful if you got in touch with me (Bill Findlay) by emailing to the address obtained by replacing “www.” in the URL of this web page by “kdf9@”.