ELEC 350 EZ-KIT LITE TUTORIAL Rev.1.1 - Raymond Li - Sep.96
===========================================================

Introduction
============

The purpose of this tutorial is to supplement the documentation supplied
with the EZ-KIT Lite. It is meant to give you a fundamental understanding
of the ADSP-2181 and AD1847 codec combination in order to quickly implement
DSP algorithms.

The reader should have at least a copy EZ-KIT Lite Reference Manual.
Optional in depth documentation include:

Data Sheets Packaged with EZ-KIT Lite:
   ADSP-2100 Family DSP Micromputers ADSP-21xx
   DSP Microcomputer ADSP-2181
   Serial-Port 16-Bit SoundPort Stereo Codec AD1847

PDF (Portable Document Format) Documentation from Analog Devices Web Site
(wwww.analog.com):
   ADSP 2100 Family User's Manual
   DSP Applications Using the ADSP-2100 Family, Vol. 1
   DSP Applications Using the ADSP-2100 Family, Vol. 2

Rutgers University DSP Course Lab Manual by [prof name]
   [url]

Bound Documentation from Analog Devices
   ADSP-2100 Family Assembler Tools & Simulator Manual
   ADSP-2181 User's Manual

Template and Batch Files by Author
   ez_shell.dsp : template for DSP coding
   ez_init.dsp  : initialization module
   ez_core.dsp  : DSP algrorithm module
   ez_end.dsp   : wrapup module
   eza.bat      : quick assembly and link
   ezl.bat      : quick upload to kit
   ezs.bat      : quick simulation

I recommend that you complete these preliminaries before attempting to
program to the DSP:

 1. Read the EZ-KIT Lite Reference Manual
    Chp1 : o all
    Chp2 : o skim
    Chp3 : o note EZ-Kit Lite board components
    Chp4 : o note subdirectories containing sample DSP programs
           o note JP2 is used to select MIC or LINE level input
    Chp5 : o note the design procedure
           o gives only brief assembly language introduction (this tutorial
             addresses that need w/o reading the 400+ page User's Manual)
           o note sample code listing (compare with ez_shell.dsp by author)
           o note assembler, linker, and simulator calling (see batch files
             by author)
    Chp6 : o try each demonstration
           o note that the simulator does not run in a DOS box under Win 3.x
             or Win95; simpler to use DOS loader (i.e. ezfast.com)
    Chp7 : o note program and data memory restrictions
    Chp8 : o all
    PQR  : o note these sections
             Development Software Invocation Commands
             Instruction Set Summary
                ALU
                MAC
                Shifter
                Data Move
                Program Flow
             Control/Status Registers
             Memory Maps
             Interrupt Vector Tables

  2. Go through this tutorial

  3. Skim through sample code listings
     o demo programs installed with the EZ-KIT Lite software
     o ez_shell.dsp by author


The tutorial is divided into 3 sections: [to complete]

Section A: Description of DSP

   A.1 Registers
   A.2 Computational Units
   A.3 Numeric Format
   A.4 Memory
   A.5 Program Control
   A.6 Data Transfer
   A.7 Multifunction Instructions

Section B: Programming DSP

Section C: Notes and Hints


In order to program the DSP you'll need to be comfortable with assembly
language. In other words you should be familiar with these concepts:

   o binary and hexadecimal to decimal conversion
   o instruction set (bit, logic, and arithmetic functions)
   o memory addressing modes
   o data & program memory management
   o registers (control, data, and status)
   o program counter
   o stack, stack pointer
   o interrupts, interrupt vector table, interrupt service routines
   o accumulator

First, you'll be introduced to the assembly language characteristics
particular to the ADSP-2181. You'll quickly notice that its syntax and
architecture are quite different from 68k, HC11, 80x86, and other
microprocessors/controllers. The ADSP-2181 syntax is algebraic like and,
with references at hand, is not too hard to understand. 

Next, many simple example code fragments will be given to illustrate the
syntax of the most common instructions. After a few examples, you'll see
the pattern of the instruction set and be able to sequence them into
your algorithm. Specifically, code syntax comparisions will be made
between ADSP-2181 and MATLAB. MATLAB's scripting language is hi-level
enough for anyone with some programming experience to understand. As
well, MATLAB is perfectly suitable for implementing DSP algorithms and
will in fact be an aid when debugging your algorithm. 

The code syntax comparisions will include:
   o variable declaration (and restrictions)
   o data transfer & assignment
   o arithmetic
   o flow control statements
   o loops

Hopefully this tutorial will be sufficient to get you started in DSP
programming. Any corrections/suggestions are welcome.

   Raymond Li
   UVic EE Comm
   r.li@ieee.ca


Section A: Description of DSP
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Registers: Used to Hold Data Values
===================================

Registers are groups of flip-flops used for memory storage. Here is a list
available registers on the ADSP-2181 (detailed explanations follow):

ADSP-Register           Register Name(s)
----------------------------------------------------------------------------
ax0, ax1, ay0, ay1      ALU Inputs
ar                      ALU Result
af                      ALU Feedback
mx0, mx1, my0, my1      Multiplier Inputs
mr0, mr1, mr2           Multiplier Results (3 parts)
mf                      Mulfiplier Feedback
si                      Shifter Input
se                      Shifter Exponent
sr0, sr1                Shifter Result (2 parts)
sb                      Shifter Block (for block floating-point format)
px                      PMD-DMD Bus Exchange
i0-i7                   DAG Index Registers
m0-m7                   DAG Modify Registers
l0-l7                   DAG Length Registers (for circular buffers)
pc                      Program Counter
cntr                    Counter for Loops
astat                   Arithmetic Status
mstat                   Mode Status
sstat                   Stack Status
imask                   Interrupt Mask
icntl                   Interrupt Control Modes
rx0, rx1                Receive Data Registers (not on ADSP-2100)
tx0, tx1                Transmit data Registers (not on ADSP-2100)

The ADSP-2181 has quite a few registers which is typical for RISC type CPUs.
This is a common tradeoff for the increased computational speed. You'll find
out that a large number of registers are needed since instructions are
restricted by to the operands (registers) they can use.


Register Wordlength [complete this]
-------------------

Along with the types of registers available, you'll need to know their
lengths:

16 bits: mx0, mx1, my0, my1, mr0, mr1, ax0, ax1, ay0, ay1, ar, sr0, sr1, si
40 bits: mr (consists of mr2, mr1, mr0)
32 bits: sr (consists of sr1, sr0)
 8 bits: mr2, se


Reserved Registers
------------------

The following is very important. Since the ADSP-2181 is interfaced to the
AD1847 codec, some resources are consumed. These include the 2 serial ports
and these registers:

   i0, l0, i1, l1

This is because these registers are used during the initialization code
for the EZ-KIT Lite:

   i0 = ^rx_buf;              {point to start of buffer}
   l0 = %rx_buf;              {initialize length register}
   i1 = ^tx_buf;
   l1 = %tx_buf;
   i3 = ^init_cmds;
   l3 = %init_cmds;

These lines are in the example code listing in the Reference Manual page
5-7. Notice that i3 and l3 are available after the initialization, while
the others are used (needed) throughout the program. (Note the algebraic
like syntax, semicolon delimiters, and braces... more on syntax later.) 


Computational Units: Gives Computational Functionality
======================================================

There are 3 computational units in the ADSP-2181. These 3 units form the
basis of the instruction set and allows you to perform a variety of
arithmetic, logical and bit manipulation functions. Note that most
instructions operate with two operands, xop and yop, and that there are
unique restrictions to what these are operands can be. Check what the
permissable xops and yops are (i.e. which registers) from the Reference
Manual.

The breakdown of the functions for each unit follows. Do become familiar
with the instruction set summary in the Appendix of the Reference
Manual as you will very often need to refer to it (pgs 10-17).

ALU (Arithmetic Logic Unit)
   Add / Add with carry
   Subtract X-Y / Subtract X-Y with borrow
   Subtract Y-X / Subtract Y-X with borrow
   AND, OR, XOR
   PASS, CLEAR
   Negate
   NOT
   Absolute Value
   Increment
   Decrement
   Divide
   Bit Operations

MAC
   Multiply
   Multiply / Accumulate
   Multiply / Subtract
   Transfer MR
   Clear
   Conditional MR Saturation

Shifter
   Arithmetic Shift
   Logical Shift
   Normalize
   Derive Exponent
   Block Exponent Adjust
   Arithmetic Shift Immediate
   Logical Shift Immediate


Program Control: Loops, Branching, and Subroutines
==================================================

The other aspect of algorithm design is the ability to control program
flow. The ADSP-2181 control flow instructions include the following (pg 20):

   Do Until
   Jump
   Call Subroutine
   Jump/Call on Flag In Pin
   Modify Flag Out Pin
   Return from Subroutine
   Return from Interrupt Service Routine
   Idle


Data Transfer: Program and Data Management
==========================================

One of the distinctions between hi-level and assembly language programming
is the details of data transfers in assembly. Instead of data types, you now
deal with addressing modes, allocation, and transfers. Here are the kinds
of data transfers you can do with the ADSP-2181 (pg 18):

   Register-to-Register Move
   Load Register Immediate
   Read Overlay Register
   Write Overlay Register
   Data Memory Read (Direct Address)
   I/O Read (Direct Address)
   Data Memory Read (Indirect Address)
   Program Memory Read (Indirect Address)
   Data Memory Write (Direct Address)
   Writes Contents of Overlay Registers to Data Memory
   I/O Write (Direct Address)
   Data Memory Write (Indirect Address)
   Program Memory Write (Indirect Address)


Multifunction Instructions
==========================

This will be new to people without previous DSP programming experience.
DSPs are distinguished from normal CPUs with their speed and their
ability to execute MACs (multiply and accumulate) instructions. As it
turns out, this instruction is used very often in DSP algorithms. To
facilitate the processing speed, the ADSP-2181 has multifunction
instructons. Since, almost all instructions execute in 1 clock cycle
(30ns), execution of multiple functions greatly increase the computation
of your DSP algorithm. Here instructions include (pg 19):

   Computation with Register-to-Register Move
   Computation with Memory Read
   Computation with Memory Write
   Data & Program Memory Read
   ALU/MAC with Data & Program Memory Read


Miscellaneous Instructions
==========================

For completeness, here are the miscellaneous instructions (pg 21):

  NOP
  Modify Address Register
  Stack Control
  Mode Control
  Put Processor in Idle State
  Put Processor in Idle State and Slow Clock by a Factor of n


Numeric Formats
===============

The ADSP-2181 is a 16-bit fixed point DSP microprocessor which means it
has 16-bits of precision to represent numeric values.

There are two general classes of DSPs: fixed-point or integer DSPs and
floating-point DSPs. Fixed point/integer DSPs are often cheaper, faster,
and consume less power whereas floating point DSPs are simpler to program.

Floating-point DSPs differ from fixed-point/integer DSPs in that they
use an exponent to indicate the radix point (decimal point) of the
numeric value.

You can see the benefits of floating-point representation from using
exponential or scientific notation - you get more dynamic range by not
being limited by the position of the radix point. 

For example: if you only had 4 decimal digits, the smallest positive
non-zero number you can represent is 0.0001 in fixed point fractional or
1 in integer format. In floating point format with a 3 digit exponent,
it would be 1E-999. Floating point allows you to omit the
non-significant digits to increase the precision of the numeric value.

Since numbers are represented in bits and not in digits in the DSP, each
bit is weighted according to its position from the radix point. 

For example 101.1010 in binary is 5.625 in decimal:

   101.101b = 1*2^2 + 0*2^1 + 1*2^0 + 1*2^-1 + 0*2^-2 + 1*2^-3 + 0*2^-4
            = 4     + 0     + 1     + 0.5    + 0      + 0.125  + 0
            = 5.625d

With the 16-bit ADSP-2181, you can have 16 different fixed-point formats by
varying the position of the radix point:

  1.15, 2.14, 3.13, 4.12, 5.11, 6.10, 7.9, 8.8, 9.7, 10.6, 11.5, 12.4,
  13.3, 14.2, 15.1, and 16.0.

The above example is in 3.4 format where 3 is the number of integer bits and
4 is the number of fractional bits.

You can calculate out a table showing the largest positive, negative, and the
LSB values the 16 formats.

For example:

Format    Largest Postive     Largest Negative       LSB Value

1.15       0.999969482421875         -1          0.000030517578125
16.0   32767.000000000000000     -32768          1.000000000000000

In general, the range of an a.b fractional format number is:

   -2^(a-1) <= x <= 2^(a-1) - 2^(-b)

where a is the number of integer bits
      b is the number of fractional bits
      x is the numeric value

In binary multiplication, the product of two 16-bit numbers is a 32-bit
number. More specifically, M.N multiplied with P.Q gives a (M+P).(N+Q)
format number. For example, the product of two 13.3 numbers is a 26.6
number and the product of two 1.15 numbers is a 2.30 number.

The product of 2 twos-complement number gives 2 sign bits; one of which
is identical and redundant. (Remember twos-complement representation?
One's complement is bit inversion, twos-complement is bit inversion then
add 1). Since one bit is redundant, you can left shift the product by one
bit.

Additionally, if one of the inputs was a 1.15 number, the left shift
causes the result to have the same format as the other input (with 16 bits
of additional precision). For example, multiplying a 1.15 number by a 5.11
number yields a 6.26 number. When shifted left one bit, the result is a 5.27
number, or a 5.11 number plus 16 LSBs. [REF]

The ADSP-2181 has two modes: fractional and integer. In fractional mode,
which is the default on reset, the multiplier result is always shifted
left one bit before being written to the result register. A left shift
causes the multiplier result to be 1.31 which can be rounded to 1.15. As
a result, the 1.15 format is the most convenient to use.

In integer mode, the left shift does not occur. The choice of mode is
controlled by a bit in the MSTAT register.

1.15 numbers are conveniently represented in hex notation. For example:

   1.15 Number     Decimal Equivalent

    0x0001          0.000030517578125
    0x7FFF          0.999969482421875
    0xFFFF         -0.000030517578125
    0x8000         -1.000000000000000


Cycle Times
===========

The ADSP-2181 normally executes all instructions in one cycle (30ns).
If an instruction causes a data fetch from program memory, an extra cycle
is required since the processor cannot pre-fetch the next instruction in
the same cycle. This overhead cycle usually occurs inside loops and only
once.


Available Memory
================

The ADSP-2181 has 80K bytes of on-chip memory. 16K words (24 bits wide) are
allocated as program RAM and 16K words (16 bits wide) are allocated for
data memory as shown:

   Total Memory:     80K bytes = 80 * 1024 * 8 = 655,360 bits

   Program Memory:   16K * 24 = 16 * 1024 * 24 = 393,216 bits
   Data Memory:      16K * 16 = 16 * 1024 * 16 = 262,144 bits

                     393,216 + 262,144 = 655,360 bits = 80K bytes

This is a significant amount of RAM compared to other DSPs from TI and
Motorola.

The memory maps are on page 30. However, they don't show the memory taken
up by the Monitor Program. The monitor program essentially is the OS for
the EZ-KIT Lite and is loaded from EPROM to their RAM locations on reset.
Note the memory restrictions on pg 7-1 and the revised memory map below:


Data Memory Map with Monitor (Words are 16 bits wide)
----------------------------------------------------

        Data Memory        Address
   +---------------------+--------+
   | 32 Memory Mapped    | 0x3FFF |
   |    Registers        | 0X3FE0 |
   +---------------------+--------+
   |    480 Monitor      | 0x3FDF |
   | Operating Variables | 0x3E00 |
   +---------------------+--------+
   |   7680 Available    | 0x3DFF |
   |   Internal Words    | 0x2000 |
   +---------------------+--------+
   |                     | 0x1FFF |
   |                     |        |
   |                     |        |
   |    8K Available     |        |
   |   Internal Words    |        |
   |                     |        |
   |                     |        |
   |                     |        |
   |                     | 0x0000 |
   +---------------------+--------+

As you can see, this leaves 15,872 data words available.


Program Memory Map with Monitor (Words are 24 bits wide)
--------------------------------------------------------

       Program Memory      Address
   +---------------------+--------+
   |    2048 Monitor     | 0x3FFF |
   |    Program Words    | 0X3800 |
   +---------------------+--------+
   |                     | 0x37FF |
   |                     |        |
   |   6144 Available    |        |
   |   Internal Words    |        |
   |                     |        |
   |                     | 0x2000 |
   +---------------------+--------+
   |                     | 0x1FFF |
   |                     |        |
   |                     |        |
   |    8K Available     |        |
   |   Internal Words    |        |
   |                     |        |
   |                     |        |
   |                     |        |
   |                     | 0X0000 |
   +---------------------+--------+

The Monitor takes up 2K words of memory leaving 14K or 14,336 words available.
This is plenty of memory for most DSP algorithms, but you will need to the
restrictions in mind when allocating large data buffers.


Shell Program
=============

The ADSP-2181 and AD1847 interface requires a fair amount of setup as shown
in the example listing starting on pg 5-5. I've merged several versions of
this "template" program into a listing called "ez_shell.dsp" which comes
with this tutorial. Do read the additional comments to gain a better
understanding of how it works and the syntax of the instructions.


Batch Files
===========

The development software executes under DOS with command line options. To
simplify the process, 3 batch files were written to speed up the iterative
process:

   eza.bat      : quick assembly and link
   ezl.bat      : quick upload to kit
   ezs.bat      : quick simulation

Take a look at the invocation of the programs in the batch files and modify
them to suit your needs.


Other Resources
===============

Further to the resources listed in the Introduction, here are some additional
resources if you are especially keen on DSPing:

[complete]
DSP FAQ
TI & Motorola web sites
DSPNet
newsgroup


Code & S/W Update
=================

An update for the EZ-KIT Lite software is available on Analog's Devices web
site. You can find it under [directions]


Instruction Set Syntax
======================
   o variable declaration (and restrictions)
   o data transfer & assignment
   o arithmetic
   o flow control statements
   o loops


{=========================================================================}
{=========================================================================}

Instruction Set Overview
------------------------

MAC Registers
-------------
mx0, mx1, my0, my1, mr, mr0, mr1, mr2

Where:
   mr0 16 LSB bits
   mr1 16 MSB bits
   mr2 overflow bits


MAC Instructions
----------------
mr = 0;
mr = xop * yop (ss); (ss) 1.15 signed numbers
mr = xop * yop (rnd); (rnd) round 32 bit result into 16 MSB in mr1

mr = mr + xop * yop (ss);
mr = mr + xop * yop (rnd);

mr = mr - xop * yop (ss);
mr = mr - xop * yop (rnd);

if mv sat mr; saturates mr to its largest (positive or negative) value
   whenever the overflow flag mv is raised

Where:
   xop: mx0, mx1, mr0, mr1, mr2, ar, sr0, sr1
   yop: my0, my1


ALU Registers
-------------
ax0, ax1, ay0, ay1, ar

ALU Instructions
----------------
ar = xop + yop;
ar = xop - yop;
ar = yop - xop;

Where:
   xop: ax0, ax1, ar, mr0, mr1, mr2, sr0, sr1
   yop: ay0, ay1


Shifter Registers
-----------------
sr, sr0, sr1, si, se

Shifter Instructions
--------------------
sr = ashift xop by exp (hi); scale xop by 2^exp into sr
   ; sr1 contains 16 MSB bits
sr = ashift xop (hi); exp preloaded into se
   ; se is the 8-bit exponent register

Where:
   xop: si, sr0, sr1, ar, mr0, mr1, mr2
   exp: any signed integer, such as, 1 -1, 2, -2, ...


DAG (Data Address Registers)
----------------------------
DAG1 only points to DM memory
   Index Registers: i0, i1, i2, i3 {14 bit registers}
   Modify Registers: m0, m1, m2, m3
   Length Registers: l0, l1, l2, l3
   
DAG2 points to DM or PM memory
   Index Registers: i4, i5, i6, i7
   Modify Registers: m4, m5, m6, m7
   Length Registers: l4, l5, l6, l7

mr1 = dm(i2, m2); {write into mr1 the contents of the DM memory location
                  pointed to by i2 and then change i2 by an amount m2}

dm(i2, m2) = mr1; {write into DM memory location pointed by i2 the
                  contents of mr1 and then change i2 by an amount m2}

modify(i2,m2);    {modify i2 by amount m2 without data access}

{=========================================================================}
{=========================================================================}

Example: Constant and Variable Declaration
------------------------------------------
.const a = 0x6000;   {a = 0.75 in decimal format}
.const D = 3;        {D = 3, an integer; can't use this in a register!}
.var/dm w[D+1];      {4-dimensional linear buffer w[i], i=0,1,2,3}
.var/dm x, y;        {temporary variables}

Example: Data Transfer
----------------------
mr1 = 0;             {load mr1 with zero}
my1 = a;             {load my1 with the constant a}
ax1 = 0x4000;        {load ax1 with the value 0x4000 = 0.50 in decimal}
ar = sr1;            {load ar with content of sr1}
mx1 = ay0;           {load mx1 with content of ay0}
mr1 = dm(x);         {load mr1 with content of DM location x}
dm(y) = my1;         {load DM location y with the content of my1}
mr1 = dm(w);         {load mr1 with content of buffer location w[0]}
                     {note syntax for buffer}
mr1 = dm(w+1);       {load mr1 with content of buffer location w[1]}
dm(w+2) = mr1;       {load buffer location w[2] with content of mr1}


Example: Linear Buffer
----------------------
.const D = 100;
.var/dm w[D+1];      {placed in DM memory}

i2 = ^w;             {i2 points to beginning of w}
l2 = 0;              {l2 must be set to 0 for a linear buffer; does not autowrap}


Example: Circular Delay-Line Buffer
-----------------------------------
.const D = 100;
.var/dm/circ w[D+1];    {circular buffer length 101 placed in DM memory}

i2 = ^w;       {i2 points to beginning of w; note ^ operator}
l2 = %w;       {l2 is set equal to the length of w; note % operator}

m2 = 1;        {post increment i2 by one}


Example: Concatenated Circular Buffers
--------------------------------------
{This declaration defines an extended circular buffer of double-length
2(M+1). The DAG pointer i4 will traverse both buffers a and b before
wrapping around to the beginning.}

.const M = 100;
.var/pm/circ a[M+1], b[M+1];

i4 = ^a; l4 = 2*(M+1);


Example: Do Loop
----------------
      cntr = l2;
      do zero until ce;          {repeat until counter expires}
zero:    dm(i2, m2) = 0;         {put 0 in dm(i2, m2) and point to next}


Examples: Getting a Circular Buffer Value
-----------------------------------------
      m2 = d; modify(iw, m2);    {go to location pointed to by i2 + d}
      m2 =-d; mr1 = dm(i2, m2);  {put its content in mr1, then restore i2}


Example: First Order Filter
---------------------------
Theory:
   y(n)=ay(n-1)+bx(n), where a=0.75, b=0.25
   output=fn(past output, present input, coefficients)

Assign internal states:
   w1(n) = y(n-1)
   w1(n+1) = y(n)


.const a = 0x6000;         {a=0.75}
.const b = 0x2000;         {b=0.25}
.var/dm w1;                {filter's internal state}

ax0 = 0;                   {ax0 is used to hold the constant 0 because there is
                            no instruction to write an immediate data value to
                            memory using an immediate address}
dm(w1) = ax0;              {initialize w1 to zero}
                     
my0 = dm(rx_buf + 2);      {get right input from codec}
                           {the x value}

mx0 = b;                   {filter coefficient b}
mr = mx0 * my0 (ss);       {mr=b*x}
mx0 = a;                   {filter coefficient a}
my0 = dm(w1);              {get internal state from DM}
mr = mr + mx0 * my0 (rnd); {mr = y = a*w1+b*x = output sample}
                           {rounded to 16 MSB}
dm(w1) = mr1;              {update state, w1=y}

dm(tx_buf + 2) = mr1;      {send right output to codec}


Example: 3rd Order FIR Filter
-----------------------------
Theory:
   y(n)=2x(n)-3x(n-1)-2x(n-2)+x(n-3)
   {dot product of input & coefficient vectors}
   
   output=fn(past inputs, present input, coefficients)

Algorithm:
   for each input x do:
      *p = s0 = x
      s1 = tap(^w,1,1,p)
      s2 = tap(^w,1,2,p)
      s3 = tap(^w,1,3,p)
      y  = 2*s0-3*s1-2*s2+s3
      cdelay()

Coefficients in [-4,4], scaled down by 4 to 1.15 format:
   h=[2,-3,-2,1] -> [0.50, -0.75, -0.50, 0.25]
                 -> [0x4000, 0xa000, 0xc000, 0x2000]
                 
The final sum will then need to be scaled up by 4.

Code:

.const M = 3;              {filter order}
.var/dm/circ w[M+1];       {delay-line buffer placed in DM}
.var/pm/circ h[M+1];       {filter coefficient buffer placed in PM}

.init h: 0x4000, 0xa000, 0xc000, 0x2000; {can be entered as 3.13}

i2 = ^w; l2 = %w           {delay-line buffer pointer and length}
i4 = ^h; l4 = %h;          {delay-line buffer pointer and length}

zero (i2, m2, l2);         {clear delay-line buffer to zero}

mx1 = dm(rx_buf +2);       {read right input from codec}

tapin(i2, m2, mx1);        {put mx1 into tap-0 of delay line}

{---Dot Product of internal states with filter coefficients---}
m2 = 1; m4 = 1;            {set increments to 1}

mr = 0, mx0 = dm(i2,m2), my0 = pm(i4,m4);
                           {example of multifunction instructions}
                           {clear, fetch, increment}
                           {executes in one cycle = 30ns}
                           {fetch s0,h0, point to s1,h1}
                           
mr = mr + mx0 * my0 (ss), mx0 = dm(i2,m2), my0 = pm(ir,m4);
                           {1st partial sum}
                           {psum = s0*h0}
                           {fetch s1,h1, point to s2,h2}
                           
mr = mr + mx0 * my0 (ss), mx0 = dm(i2,m2), my0 = pm(ir,m4);
                           {2nd partial sum}
                           {psum = psum + s1*h1}
                           {fetch s2,h2, point to s3,h3}
                           
mr = mr + mx0 * my0 (ss), mx0 = dm(i2,m2), my0 = pm(ir,m4);
                           {3rd partial sum}
                           {psum = psum + s2*h2}
                           {fetch s3,h3, point to s0,h0}
                           {wrap around to s0,h0}
                           
mr = mr + mx0 * my0 (rnd)  {mr = y; final sum}
                           {final sum}                           
                           {sum = psum + s3*h3}
                           {sum in mr1}

if mv sat mr;              {check for saturation}

cdelay(i2, m2);            {update delay}

sr = ashift mr1 by 2 (hi)  {scale output by factor of 2^2 = 4}

dm(tx_buf + 2) = sr1       {write right output to codec}

   {---Using Do-Loop for Multifunction Instructions---}
            m2 = 1; m4 = 1;
            mr = 0, mx0 = dm(i2,m2), my0 = pm(ir,m4);
            cntr = M;      {M = filter order}
            do dotloop until ce;
   dotloop:    mr = mr + mx0 * my0 (ss), mx0 = dm(i2,m2), my0 = pm(i4,m4);
            mr = mr + mx0 * my0 (rnd);
            if mv sat mr;

   {---Replace Multifunction Instructions with DOT.DSP Macro---}
   mx1 = dm(rx_buf + 2);      {read right input from codec}
   tapin(i2, m2, mx1);        {put mx1 into tap-0 of delay line}
   dot(M, i4, m4, i2, m2);    {compute output into mr1}
   cdelay(i2, m2);            {update delay}
   sr = ashift mr1 by 2 (hi); {scale output by factor of 2^2 = 4}
   dm(tx_buf +2) = sr1;       {write right output to codec}

   {---Replace Instructions with CFIR.DSP Macro---}
   mx1 = dm(rx_buf + 2);      {read right input from codec}
   cfir(M, i4, m4, i2, m2, mx1); {input from mx1, output in mr1}
   sr = ashift mr1 by 2 (hi); {scale output by factor of 2^2 = 4}
   dm(tx_buf +2) = sr1;       {write right output to codec}

{=========================================================================}
{=========================================================================}

{zero.dsp - initialize delay line buffer to zero.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

       %0 = pointer to delay-line buffer, e.g.,  I2
       %1 = M-register to use with buffer, e.g., M2
       %2 = length of buffer, e.g., L2

       typical usage:
       --------------
       zero(i2, m2, L2);           i2 cycles back to its initial value

       internal operation:
       -------------------
       cntr = L2; m2 = 1; 
              do loop until ce;
       loop:         dm(i2, m2) = 0;
}

.macro zero(%0, %1, %2);
.local loop;

cntr = %2;  %1 = 1;
       do loop until ce;
loop:     dm(%0, %1) = 0;

.endmacro;

{=========================================================================}
{=========================================================================}

{tap.dsp - tap outputs of circular delay line.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

 Based on tap.c and tap2.c of Introduction to Signal Processing.

       %0 = pointer to delay-line buffer,  e.g., i2
       %1 = M-register to use with buffer, e.g., m2
       %2 = d, for d-th tap content, where d=1, ... ,D
       %3 = data register for result, e.g., 
            ax0, ax1, ay0, ay1, ar, mx0, mx1, my0, my1, mr1, sr1

       typical usage:
       --------------
       tap(i2, m2, d, sr1);               put d-th tap content into SR1
                                          note: i2 is not changed
       internal operation:
       -------------------
       m2 = d;   modify(i2, m2);          point to d-th tap
       m2 =-d;   sr1 = dm(i2, m2);        put d-th tap in data register and
                                          restore i2 to its entry value
}

.macro tap(%0, %1, %2, %3);

%1 =  %2;  modify(%0, %1);                {point to d-th tap}
%1 = -%2;  %3 = dm(%0, %1);               {put d-th tap in data register}

.endmacro;

{=========================================================================}
{=========================================================================}

{tapin.dsp - put input sample into tap-0 of delay line.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

       %0 = pointer to delay-line buffer, e.g.,  I2
       %1 = M-register to use with buffer, e.g., M2
       %2 = data register holding input, e.g.,
            ax0, ax1, ay0, ay1, ar, mx0, mx1, my0, my1, mr1, sr0, sr1

       typical usage:
       --------------
       tapin(i2, m2, mx1);                put value from MX1 into 0-th tap
                                          note: i2 is not changed
       internal operation:
       -------------------
       m2 = 0;   dm(i2, m2) = mx1;
}

.macro tapin(%0, %1, %2);

%1 = 0;  dm(%0, %1) = %2;          {put value from dreg %2 into delay line}

.endmacro;

{=========================================================================}
{=========================================================================}

{cdelay.dsp - update circular delay-line buffer.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

 Based on cdelay.c and cdelay2.c of Introduction to Signal Processing.

       %0 = pointer to delay-line buffer,  e.g., i2
       %1 = m-register to use with buffer, e.g., m2

       typical usage:
       --------------
       cdelay(i2, m2);

       internal operation:
       -------------------
       m2 = -1;  modify(i2, m2);    (i.e., backshift pointer i2)
}

.macro cdelay(%0, %1);

%1 = -1;  modify(%0, %1);             {backshift pointer}

.endmacro;

{=========================================================================}
{=========================================================================}

{dot.dsp - dot product of a DM with a PM circular buffer of length M+1.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

       %0 = filter order M, i.e., length L = M+1
       %1 = pointer to filter taps buffer in PM, e.g., i4 (not modified)
       %2 = m-register to use with tap buffer, e.g., m4
       %3 = pointer to delay-line buffer in DM, e.g., i2 (not modified)
       %4 = m-register to use with delay buffer, e.g., m2
       
       result is returned in MR1;
       i2, i4 are not modified - they cycle around to their entry values

       typical usage:
       --------------
       dot(M, i4, m4, i2, m2);

       internal operation:
       -------------------
       m2 = 1; m4 = 1;
       mr = 0, mx0 = dm(i2, m2), my0 = pm(i4, m4);
       cntr = M;
       do loop until ce
loop:     mr = mr + mx0 * my0 (ss), mx0 = dm(i2, m2), my0 = pm(i4, m4);
       mr = mr + mx0 * my0 (rnd);
       if mv sat mr;
}

.macro dot(%0, %1, %2, %3, %4);
.local loop;

       %2 = 1; %4 = 1;
       mr = 0, mx0 = dm(%3, %4), my0 = pm(%1, %2);
       cntr = %0;
       do loop until ce;
loop:     mr = mr + mx0 * my0 (ss), mx0 = dm(%3, %4), my0 = pm(%1, %2);
       mr = mr + mx0 * my0 (rnd);
       if mv sat mr;

.endmacro;

{=========================================================================}
{=========================================================================}

{cfir.dsp - direct-form FIR filter of order M using circular buffers.
 Junior DSP Lab - Rutgers ECE Dept - S. J. Orfanidis - Jan 1996.

 Based on cfir.c and cfir2.c of Introduction to Signal Processing.
 In book: y = cfir(M, h, w, &p, x);

       %0 = filter order M, so that filter length is L = M+1
       %1 = pointer to filter taps buffer in PM, e.g., i4
       %2 = m-register to use with tap buffer,   e.g., m4
       %3 = pointer to delay-line buffer in DM,  e.g., i2
       %4 = m-register to use with delay buffer, e.g., m2
       %5 = data register holding input, e.g.,
            ax0, ax1, ay0, ay1, ar, mx0, mx1, my0, my1, mr1, sr0, sr1
       
       the filter output is returned in MR1
       and the delay-line pointer i2 is updated, that is, backshifted

       typical usage:
       --------------
       cfir(M, i4, m4, i2, m2, mx1);

       internal operation:
       -------------------
       tapin(i2, m2, mx1);         put input from MX1 into tap-0
       dot(M, i4, m4, i2, m2);     compute dot-product output
       cdelay(i2, m2);             update delay line
}

.macro cfir(%0, %1, %2, %3, %4, %5);

       tapin(%3, %4, %5);          {read input sample into delay line}
       dot(%0, %1, %2, %3, %4);    {compute filter output into MR1}
       cdelay(%3, %4);             {update delay line}

.endmacro;