[Lowfer] AFRICAM 4.4 available

Tue, 17 Feb 2004 09:28:48 -0500

I've been going around in circles with this "Undouble error" in
AFRICAM.  Last night I ran the program on a different computer
and was somewhat surprised to have a whole different batch of
symptoms show up.  More so than I thought, the performance of
AFRICAM depends on the system on which it runs.  Here's what I
found:

1.  Do not even *think* about running this program with any kind
of Windows operating system.  It may *appear* to work, but the
bit timing will be pretty horrible, especially at faster speeds.

2.  I switched from my usual Compaq ARMADA laptop to another Compaq,
an e-Bay "cheapie" -  an LTE5300 model.  Now the LTE5300 is *faster*
than the ARMADA (133 Mhz vs 100 Mhz Pentium) and the response time
on-screen is noticeably faster, but the LTE5300 was giving me my
first-observed "Cache overrun, hit any key to shut down gracefully"
error messages - every now and then.  This is a pretty serious error
message - a holdover from the old days when we had 386's which weren't
always fast enough to keep up with the demands of processing all the
incoming data at MS100.  The "cache" in AFRICAM holds units of
"half-digested" audio sample data in blocks of 1 bit-time duration.
The initial processing is done in machine language as part of the
interrupt service routine, then the half-digested information is
stacked up for AFRICAM's background processes to deal with.  Normally
on a P100 class machine the cache is almost always right up to date.
If you hit the HOME key you will get some diagnostic information on
bottom of screen.  It gives the current cache depth, and the max
depth at any point since you last looked.  The maximum depth of
the cache is 128 entries.  That's good for 12.8 seconds worth of
received data at MS100.  That's *received* data, has nothing to
do with transmission.  I found the reason for these cache overruns
was because the hard drive on the LTE5300 takes a long time so spin
up to speed after it has shut down to save power.  The one on the
ARMADA takes about 1.7 seconds (or so it seems, in practice it may
be 1.7 seconds only until the controller on the drive is ready to
accept data into its buffer, and if we try to write too much data
there will be a much longer delay).  AFRICAM does not read from
disk while it is running, it only writes periodically, as when
TRACE is on or the LOGCAL (diagnostic) is on.  There are some
other ways it can write a *lot* of data to disk, but that is not
a problem because then the drive motor never turns off.  The
time to complete a disk write after the drive has spun down is
probably determined by the drive itself, not the operating system
or the BIOS.

3.  I was transmitting a test message (2IQ<sp>) and receiving my own
transmission at the same time.  Running as MS25 ET1, run length 4, no
grab.  A whole bunch of "2IQ 2IQ ".... was scrolling nicely down the
screen.  After about 30 minutes I glanced over at the screen and noticed
it was copying groups of 4 characters (same characters each time) but those
characters were not 2IQ<sp>!  Something had "slipped" even though I was
using GPS-disciplined sync.  I turned off AUTOTRACK and SYNC and manually
stepped up one start bit.  The correct data reappeared.  Somehow, AFRICAM
had managed to get out of sync with itself!  Same program transmitting and
receiving under GPS control!  The UTC clock at top of screen was now
1 second slow compared to WWV.  Obviously the system had missed a 1-pps
interrupt at some point.  Actually, it had missed the opportunity to
process a 1-pps interrupt in a timely manner.  There is no "cache" for
1-pps interrupts.  When the interrupt occurs the program absolutely *must*
get around to processing it before the next 1-second interrupt comes along
or else there will be serious problems.   Normally these 1-pps interrups
are processed within a millisecond.  So what happened?  I broke down and
installed a proper RS232 driver for the 1-PPS line, thinking maybe that
was the culprit.

4.  I started watching the calibration numbers rolling down the screen.
The system does a calibration of the Timer0 machine clock, and the
sound card sampling rate every minute.  The calibration data is gathered
over the minute, and the results are reported at the 59-second mark of
each minute.  Every now and then there was an anomalous calibration.  The
T0 frequency or the sound card sampling rate (sometimes both in same minute)
would jump from its nominal value!  This is consistent with missing an
interrupt,
double-clocking an interrupt, or having an interrupt delayed at the
hardware level
(e.g. by someone else turning off the system interrupts for a relatively
long time)
so that the program doesn't see the 1-PPS interrupt right away. It's
equivalent
to adding a *lot* of extra jitter on the pulse.

5.  Every now and then I'd notice there was an error in the received copy.
Now I was receiving my own local transmission, so the SNR was superb - plenty
of strong signal to work with.  Turns out the sending routine was being
held up
by something (for quite a few bit times at MS25), but then once AFRICAM got
the machine back it would send all the missed bits real fast to get caught up
to real-time.  Then everything seemed to be OK again.

6.  Eventually I got one of those "Undouble" errors.

All of the above are symptoms you'd expect from running under Windows,
where the
OS time-slices.  AFRICAM in its present incarnation can't live with
time-slicing.
But I wasn't using Windows.   I went into the BIOS and tried to customize the
"Energy Conservation" settings - that helped but did not completely
eliminate the
problems.  So I disabled all conservation measures at the BIOS level.  That
cured
all the problems.  The program has been running (admittedly at MS1000) since
before midnight last night steadily transmitting (and receiving) "2IQ "
without
missing a beat.  Unfortunately the screen lamp never shuts off and the hard
drive
spins all the time, but c'est la vie.

There are some changes in 4.4 related to the "undouble" error problem.
Please note
I still got an undouble error even with Rev 4.4, so the cure isn't
completely in
yet.  I'm becoming more and more convinced that this is caused by hardware
issues
and not software (programmer's famous last words!).  AFRICAM does not use
the 1-pps
interrupts directly for either send or receive timing.  It uses them only
to calibrate
the internal machine clock (for sending) and the sound card sampling rate
(for receiving).
This is necessary because we have to support bit rates up to MS5 (200
baud), so we have
to divide the 1-pps time pulses up into smaller chunks of time but still
maintain GPS
accuracy.  The method used was to accurately calibrate the internal CPU
clock (using GPS),
then use *that* to control the bit timing, at least until the next 1-pps
interrupt comes
along.  That is why, for example, you can pull the 1-pps plug altogether
while it is
sending and it will continue sending almost perfectly.  It uses the
internal oscillator.
The UTC clock at top of screen won't update though, because that only
updates when the
1-pps interrupts occur.  If we miss one, the clock will run 1-second slow
forever after.
The NMEA information is used only at the very beginning of a session to set
the DOS clock
accurately.  Thereafter AFRICAM relies on the 1-pps ticks.  There is a
parameter called
TIXPS (ticks per second) - which refers to the number of clocks to the T0
timer in each
second.  This is a calibrated value, measured over the previous minute.
AFRICAM uses
it when calculating how many ticks to wait until the next bit goes out when
sending.  If
the value of TIXPS gets corrupted, that could easily cause an "Undouble"
error.  For
general info, that's an error message from a procedure called "undouble"
which converts a
real (floating point) number into a 32-bit integer.  The message tells us
that somebody
passed it a parameter whose absolute value was greater than 2,147,483,647 -
and these
values are always *relative* to a preceeding reference point in time, which
hopefully
moves ahead each second on the 1-pps interrupt.  The nominal T0 tick rate
is 1,193,180 per
second, and the longest we'd expect to wait for a next bit out would be a
couple of minutes,
and that's an extreme case which only happens at the begining of a
transmission at a very
slow bit rate, ET3, when you ask it to start sending and it has just missed
a UTC time slot
opportunity and has to wait for a whole frame time to come around again.
Even so, we
have enough headroom in the register length for 1800 seconds of delay, so
that should cover
even the worst case.  In other words, those "undouble" errors just should
not occur, period.
They *will* occur if 1) we lose GPS interrups completely for half an hour
or so, 2) the TIXPS
calibration value is grossly wrong so the program overestimates the number
of ticks it needs
to cover a relatively short time span, 3) there is still some bug in the
program code.

I put a "reasonableness" check on the TIXPS calibration value - so the
program keeps the
old value if the new value is too far off what it has been seeing.

We *could* make another cache for processing GPS interrupts, but heck, this
system is already
way too complicated as it is.  Let's try to make sure the program always
gets enough CPU cycles
to do whatever it needs to do in real-time.  That means go into the BIOS
and turn off everything
that could steal machine cycles.  Don't run anything in the background
(kill all those TSR's!).
And let's try to keep our hardware installation clean -free from
RFI-induced extra interrupts
or missed interrupts.  And I think AFRICAM will run for days at a time
without problems.

Oh yes, one other thing...  The LTE5300 has an integrated sound system, but
it still uses the
ESS688 chip, same as the one used in my ARMADA's PCMCIA sound card.  But
whereas the ARMADA
is right on 8200 samples per second, and stable as a rock, the LTE5300 is
way too high, at
8200.49 something, and is much less stable.  I've looked on the motherboard
and can't find
any crystal or oscillator module near the ESS688 LSI chip, so seems they
are deriving the
sampling rate clock from something available on the bus.  The main
oscillator seems to be a
module at 66.666666 Mhz (presumably doubled to get the 133 Mhz Pentium
clock frequency).  Does
anyone know how they derive either the T0 clock or the sound chip clock
from that freq?  It is
not a simple divide by n sort of thing.

Fingers crossed,
Bill

Version 4.4 is available from my web site:

www.magma.ca/~ve2iq