Optimising EM87.OBJ

Hi all,

When Borland developed Turbo Pascal way back in the very early 1980's,
the 8087 was expensive, and so they came up with a 6-byte real, which
was fairly accurate, and could be handled quite efficiently by Intel
CPU's, using dx/bx/ax & di/si/cx as two 6-byte registers. They also sold
separate compilers that used BCD or the FPU. From version 4 the compiler
could handle both, but realising(?) that not everyone had a still pretty
expensive 8087/80287 in their PC, the code to use one of these chips was
put into four separate files. Of those files, only two, EI86 and EI87,
were provided in .ASM format. They are responsible for initialising the
software and hardware emulators - yes, even if you've got an FPU in your
system, there are still some functions that are not natively implemented
on it!

The other two, EM86 & EM87 are only supplied in .OBJ format. Of these
two, EM87 is the most interesting. It can be pushed through OBJ2ASM and
the result is quite interesting: it contains a shitload of code that is
never ever used by any of Borland's Pascal compilers. The original TP6
EM87 file contains 1047 bytes, the BP7 version two more due to an added
FSTP ST(0) instruction. (NB: If you are using the superb replacement
libraries for TP6 (TPL60N19.ZIP) or BP7 (BPL70N16.ZIP) from Norbert
Juffa (available on Garbo), then you should be aware that they are not
using the standard EM87.OBJ file. Norbert patched the file to insert
NOPs before FPU instructions if running on a CPU > 8086)

The emulation itself is done through interrupts 34 to 3D, but Borland
also uses interrupt 3E to implement a number of shortcut calls. They are
encoded as the two bytes following the "int 3e" instruction, and they
are, straight from RB61:

> Notes: the two bytes following the INT 3E instruction are the subcode
>          (see #03195) and a NOP (90h), except for subcodes DCh and
>          DEh, where the second byte is a register count (01h-08h)
>        this vector is modified but not restored by Direct Access v4.0,
>          and may be left dangling by other programs written with the
>          same version of compiled BASIC
> SeeAlso: INT 3D

> (Table 03195)
> Values for Borland floating-point shortcut subcode:
> Subcode         Function
>  DCh    load 8086 stack with 8087 registers; overwrites the 10*N bytes
>           at the top of the stack prior to the INT 3E with the 8087
>           register contents
>  DEh    load 8087 registers from top of 8086 stack; ST0 is furthest
>           top of 8086 stack
>  E0h    round TOS and R1 to single precision, compare, pop twice
>           returns AX=8087 status word, FLAGS=8087 condition bits
>  E2h    round TOS and R1 to double precision, compare, pop twice
>           returns AX=8087 status word, FLAGS=8087 condition bits
>         Note: buggy in TPas5.5, because it sets the 8087 precision
>           control field to the undocumented value 01h; this results in
>           actually rounding to single precision
>  E4h    compare TOS/R1 with two POP's
>           returns FLAGS=8087 condition bits
>  E6h    compare TOS/R1 with POP
>           returns FLAGS=8087 condition bits
>  E8h    FTST (check TOS value)
>           returns FLAGS=8087 condition bits
>  EAh    FXAM (check TOS value)
>           returns AX=8087 status word
>  ECh    sine(ST0)
>  EEh    cosine(ST0)
>  F0h    tangent(ST0)
>  F2h    arctangent(ST0)
>  F4h    ST0 = ln(ST0)
>  F6h    ST0 = log2(ST0)
>  F8h    ST0 = log10(ST0)
>  FAh    ST0 = e**ST0
>  FCh    ST0 = 2**ST0
>  FEh    ST0 = 10**ST0

Now, if you look at the OBJ2ASM created source for EM87, you will see
that there is one very obvious optimisation you can make. At offset
0402, you will find a unconditional jump, followed by "fcom st(1)" and
another unconditional jump. There is no path to the "fcom st(1)", so it
can be zapped, saving 5 bytes. (Wow ;) )

Of course we are not really satisfied with a saving of 5 bytes, so let's
see what more can be done. This requires the RTL source, but if you do
not have access to it, don't worry, I've already gone over it, and it
turns out that all code referring to subcodes DCh..EAh can be safely
zapped, saving 139 bytes, which is beginning to look more promising...

Of course removal of this code also means that the jumptable used to
access them can be shortened, saving an additional 16 bytes, for a total
of 145 bytes.

The next thing we can look at are the remaining 10 shortcuts. It turns
out, only five of them are used by the RTL (the other 5 can be used with
the MATH unit included in Norbert Juffa's TPL60N19.ZIP and BPL70N16.ZIP
replacement run-time libraries). The five that are really(?) required
are ECh/EEh/F2h/F4h/FAh. Sadly they are not consecutive, but it's not a
problem to point the jump-table entries of the five others to a RETN
instruction. Removing the code associated with them saves another 27
bytes, bringing our total savings to 172 bytes.

Is this all we can do? Of course it isn't. The 8087 is dead as a dodo,
or at least as a Tasmanian wolf, so we don't need FWAIT instructions
before every FPU instruction. Add .286/.287 (or .386/.387) to the file
to zap them, saving another 100 odd bytes.

Are we finished? Guess?

No, we can do more, but this requires modifications to another RTL file,
F87H, and we also need to add some code to MAIN to abort upon detection
of anything below the 386...

So what do we need to do? I'll leave the modifications to MAIN up to
you, but in F87H, the calls to shortcut codes ECh, EEh and F2h (i.e.
Sin, Cos and Arctan) can be replaced directly by FSIN, FCOS and FPATAN,
although the first two will need a tiny bit of additional code to detect
invalid arguments, the code below comes from Norbert Juffa's TPL60N19:

    wait               ; check for unmasked exceptions
    fnstsw  ax         ; get FPU condition codes
    sahf               ; transfer condition codes to CPU flags
    jnp     @ok        ; argument in range, done
    mov     ax, 207    ; error code 207, illegal float operation
    jmp     HaltError  ; exit thru error handler

Having done this, we can now remove the code for these functions, and
four more entries from the jump-table. The final (oh no!) result is a
new EM87.OBJ with a size of a mere 410 bytes, a saving of 637 bytes, be
it that we will lose some of these savings for the above FPU result
check code and the "not-a-386(+)-then-boom!" processing.

As for the "oh no!"...

In his unsurpassed SPO.EXE, from SPO120.ZIP (on Garbo) Morten Welinder
uses "fldln2/fxch/fyl2x" to replace the Ln shortcut call. In itself this
is OK, the original code stripped down to it's bare essentials boils
down to almost the same code. Almost? It tests if the input value is
less than 1.0000305... and if this is the case, it uses the more
accurate fyl2xp1 function. I do not know if this is required, a quick
test did not show any differences for the two, over the full word range
of the two lsb of an extended number...

Finally, what does the new EM87.ASM look like? Sadly, these files are
too long to post. The files I've got available are:

EM87.ASM - standard replacement, full functionality (20k)
EM87.V2  - zap all unused code only (14k)
EM87.V3  - needs mods to F87H to implement several functions (9k)

All files contain, as a comment, optimised Norbert Juffa code to insert
NOPs rather than FWAITs on anything over a 8086.

Private email to get a copy - since last week I've got internet access
at home, so if Timo is willing to accept them, they might appear on
Garbo sometime in the future.

Robert AH Prins

PS: Yes, you can also get EM86.ASM, it's a monster at 122k, and only
marginally commented. It's buggy and probably not very useful.