Friday, August 28, 2015

Getting started with the STM32F7 Discovery board

The other week I got a brand new STM32F7 Discovery board, the fabled Cortex M7 had arrived and performance bliss was at hand ...

... except there were a few minor stumbling stones on the road to paradise ...

this is not an ad for ST, I did pay the full Mouser price.

  • ST programming tools for the ST-Link interface are not available for Mac OSX, and the board does not expose the JTAG/SWD of the Cortex M7, it is only available through the onboard ST-Link adapter chip. The GGC ARM Embedded toolchain was happy to generate code after importing some system defines and startup code from STM32Cube_FW_F7_V1.1.0 library and a bit of tweaking to Makefiles and linker scripts. It was possible to flash the binaries to the board using the USB file system interface, but then there is no debug connection, and it refused load code that was linked to use fast ITCM flash access.
  • With some some work it was possible to add the chip and flash configuration details to the free stlink utility and also tweak the stlink RAM based flash writing code to handle the 64 bit flash access so flash is correctly written, all running on Mac OSX.
  • The example code in the STM32Cube_FW_F7_V1.1.0 library does not by default use the external 25MHz crystal clock, only the onchip internal 16MHz oscillator. The PLL is also not configured by default, meaning the chip runs at 16MHz, and not the 196 or 216 Mhz that should be possible.
  • For optimal performance the linker settings should be configured to make the code use the ART accelerated ITCM flash access at 0x0020000 instead of the normal, slow, flash access at 0x0800000. When the linker was configured this way the ST onboard USB flash loader refused to load code into flash, but stlink utility worked.
  • The external SDRAM for the Discovery board has a different configuration than the one in the "system_stm32f7xx.c" supplied from ST. It seems ST has used the settings for another STM32F7 board and not been careful to make the changes for the F7 Discovery board.
Virtual USB com port was quite easy, it turns out that the USB subsystem is almsot identical to the the one on STMF4 chips.

After these tweaks I was able to run some simple tests where the chip executed between 200 and 300M instructions/sec. I think this can be a great chip for audio synthesis.

Next will be the big LCD display, but that must wait for when I have lots of free time again.

Monday, August 10, 2015

More USB and MIDI

Time goes fast, new and more powerful microprocessors are introduced, and the favourites of a year ago are fading and starts to collect dust in a draver filled with useful but not quite exciting boards. These days the Cortex M3 of the Maple board feels a bit old, the Teensy 3.1 is a great board and the Arduino type libraries and support is superb, but it has a bit limited memory and no hardware floating point.The boards I use mostly for audio generation at the moment are the Netduino plus 2, without the Netduino bootloader and .Net libraries, the STM32F4 Discovery and the XMC4500 Relax Kit Lite. All of these have Cortex-M4F processors, more than 128KB RAM and  1024KB of flash, they are all cheap and good value for money.

I need stable USB MIDI and after wrangling with the device libraries and development platforms, endless hierachies of folders and files that are usually targeted at Windows and often lacking USB MIDI, I decided to write yet another lightweight USB implementation. I know this might sound stupid and a waste of effort but having some time free due to vacation times and a bit of rainy summer I went ahead.

The code is written from scratch and uses only some MCU specific header files borrowed unchanged  from the manufaturer supplied code libraries. When time allows and I have done some more testing a few minimal example programs will be added.

I have uploaded the code to my Cortal Dendrites repository on Github:

Happy Coding :)

Friday, May 22, 2015

Computing fractional multipliers and divisors using continued fractions.

This blog post will discuss how to find the best rational approximation with bounded numerator or denominator to a a preferred frequency ratio. This will be a bit mathematical but actually nothing more than standard addition, subtraction, multiplication and division will be used. Some example code is given at the end.

Fractional baudrate generation

Some microprocessors uses fractional multipliers and divisors to generate accurate baud rates from system clocks that are not an integer multiple of the bitslice clock frequency. The bitslice frequency is the baudrate times an oversampling count.

\[ f_{bitslice} = oversample \cdot baudrate = ratio \cdot f_{sysclock} \tag{1} \]

An example is the USIC in the Infineon XMC1100 and XMC4500 family microprocessors. Here the clock for the bitslices is derived from the system clock, slightly simplified, as 

\[ f_{bitslice} = (DCQT+1) baudrate = \frac 1 {(PDIV+1)} \cdot \frac {STEP} {1024} \cdot f_{sysclock} \]

DCQT gives the oversampling count, typically set to 15. PDIV and STEP are register values controlling the frequency division. On the XMC1100 PDIV and STEP are 10 bit wide so they have values from 0 to 1023. Rearranging the above we see that we must find a good approximation

\[ \frac {STEP} {(PDIV+1)} \approx \frac {(DCQT+1) baudrate \cdot 1024} { f_{sysclock} } \tag{2} \]

with values of PDIV and STEP between 0 and 1023.

Example: 38400 baud from a 32MHz clock and 16 times oversampling

We enter the values in (2) to find

\( \frac {STEP} {(PDIV+1)} = \frac {16 \cdot 38400 \cdot 1024} { 32000000 } = \frac {629145600} { 32000000 }  = 19.6608 \) 

Working by hand we would now factor out common factors from numerator and denominator, but for computer implementations that is unnecessary extra work. As we shall see the algorithm works as well without that step.
Now we calculate the signed integer division with remainder using the rounded value 20 as quotient.

\( \frac {16 \cdot 38400 \cdot 1024} { 32000000 } = 20 - \frac {10854400} { 32000000 }  = 20 - \frac {1} {  \frac { 32000000 } {10854400}  } \tag{3} \) 

Rewrite the denominator in the last fraction by carrying out the division with remainder

\(  \frac { 32000000 } {10854400}  = 3 - \frac {563200}{ 10854400 } \approx  2.948 \approx 3  \) 

We enter this value 3 in the previous formula (3) to get the approximation

\[ \frac {16 \cdot 38400 \cdot 1024} { 32000000 } \approx 20 - \frac {1} { 3 } = \frac {59} {3} = 19.6667 \] 

This is good, after one iteration the error is less than 0.008 or about 300ppm.
We can improve this by using a better using a more precise value

\(\frac { 32000000 } {10854400} = 3 - \frac {563200}{ 10854400} = 3 - \frac 1 { \frac { 10854400 }{563200}}  = 3 - \frac 1 {19 + \frac {153600 }{563200}}  \approx 3 - \frac 1 { 19 } = \frac {56}{19}\)
Inserted into (3) this gives

\( \frac {16 \cdot 38400 \cdot 1024} { 32000000 } \approx 20 - \frac {1} {  \frac {56}{19} } = 20 - \frac {19} {56} =  \frac {1101} {56} = 19.6607  \)  

This approximation is better, error is about 3ppm, but the STEP value 1101 is to large for the 10 bit range, so we keep the previous approximation of 59/3.

Best rational approximation with continued fractions

In order to find a general method to calculate good rational approximations as in the previous example we use the theory of partial quotients, or convergents, of continued fraction expansions and best rational approximation theory. [TO DO add links ]   This is a well developed general theory for finding the best rational apprimations to a number with a given size of the denominator. There are recursive algoritms for calculating succusively better approximations.

So we are given a rational number p/q and we want to find succesive approximations \( a_n/b_n \) to p/q with smaller values for \( a_n \) and \(b_n \) than p and q. For the baud rate generations we want the best approximation where \( a_n \) and \(b_n \) fits in the respective fractional multiplier and divisor registers.

The formula

\[ \frac p q = c_0 + \frac {1}{c_1 + \frac {1}{c_2 + \ddots \frac {1}{c_n + \frac {1} {r_{n+1}}}}} \approx  c_0 + \frac {1}{c_1 + \frac {1}{c_2 + \ddots \frac {1}{c_n }}}  = \frac {a_{n+1}}{b_{n+1}} \]

For the first few values of n we get:

\[ a_{-1} =0,  b_{-1}=1 \]
\[ a_0 =1,  b_0=0 \]
\[  \frac {a_{1}}{b_{1}}  =  c_0 = \frac {0 + 1 \cdot c_0}{1 + 0 \cdot c_0}  = \frac {a_{-1} + a_0 \cdot c_0}{b_{-1} + b_0 \cdot c_0}   \]
\[ a_1 =c_0,  b_1=1 \]
\[  \frac {a_{2}}{b_{2}}  =  c_0 +  \frac {1}{c_1} =  \frac {c_0 \cdot c_1 + 1}{c_1} = \frac {a_{0} + a_1 \cdot c_1}{b_{0} + b_1 \cdot c_1}   \]
\[  \frac {a_{3}}{b_{3}}  =   \frac {a_{0} + a_1  ( c_1+ \frac 1 {c_2})}{b_{0} + b_1 ( c_1+ \frac 1 {c_2})} = \frac {(a_{0} + a_1  c_1 ) c_2+ a_1}{(b_{0} + b_1  c_1 ) {c_2}+  b_1} = \frac {a_1 + a_{2}{c_2}}{b_1+b_{2}{c_2}} \]
We now can see the recursion formula develop
\[  \frac {a_{4}}{b_{4}}  =   \frac {a_{1} + a_2  ( c_2+ \frac 1 {c_3})}{b_{1} + b_2 ( c_2+ \frac 1 {c_3})} = \frac {(a_{1} + a_2  c_2 ) c_3+ a_2}{(b_{1} + b_2  c_2 ) {c_3}+  b_2} = \frac {a_2 + a_{3}{c_3}}{b_2+b_{3}{c_3}} \] 

\[  \frac {a_{n+1}}{b_{n+1}}  =   \frac {a_{n-2} + a_{n-1}  ( c_{n-1}+ \frac 1 {c_n})}{b_{n-2} + b_{n-1} ( c_{n-1}+ \frac 1 {c_{n}})} = \frac {(a_{n-2} + a_{n-1}  c_{n-1} ) c_n+ a_{n-1}}{(b_{n-2} + b_{n-1}  c_{n-1} ) {c_n}+  b_{n-1}} = \frac {a_{n-1} + a_{n}{c_n}}{b_{n-1}+b_{n}{c_n}} \] 

The algorithm

We start with a rational number r = p/q.  We will generate a sequence of five values \(a_n, b_n, c_n, p_n\) and \(q_n\). The quotients \(a_n/b_n\) will be our succesive approximations to r. The values \(r_n\) are defined as \(p_n/q_n\) but they need not be explicitly calculated, the formula for \(r_n\) is used as the template for how \(p_n\) and \(q_n\) are updated.

\( r_0 = r  \)
\( p_0 = p,  q_0 = q \)
\( a_{-1} =0,  b_{-1}=1 \)
\( a_0 =1,  b_0=0 \)
Loop until q_n+1 is 0 or an or bn are to large
\( c_{n} = round(r_{n}) = round ( p_{n}/q_{n} ) \)
\( r_{n+1} = 1/(r_n-c_n) \)    This is done calculated terms of \(p_n\) and \(q_n\)
\( p_{n+1} = q_n \)
\( q_{n+1}=p_{n}-c_n \cdot q_n \)
\( a_{n+1}=a_{n-1}+a_{n} \cdot c_{n} \)
\( b_{n+1}=b_{n-1}+b_{n} \cdot c_{n} \)

The rounding operations can be done downwards, keeping all values positive, or towards the nearest integer improving precision but also introducing negative numers and extra complexity. Note that the \(c_n\) value is calculated before updating \(p_n\) and \(q_n\) and is not remembered and used in the next iteration but recalculated.

A skeleton C implementation

Following code lacks error handling and only checks limit for numerator, but it illustrates the algorithm. It does lack a few comments, but follows the algortithm structure closely.

void cfractr(int32_t p, int32_t q,int32_t alim, uint32_t * ares, uint32_t * bres) {
    int ap = 0;
    int a1 = 1;
    int bp = 1;
    int b1 = 0;
    int cn = a1, anext, bnext, pnext;
    while (1) {
        /* Signed rounded rational division */
        if ((q>0)&&(p>0)||(q<0)&&(p<0))
            cn = (p+q/2)/q;
            cn = (p-q/2)/q;
        /* Next value for partial quotients and remainder */
        anext = ap + cn*a1;
        bnext = bp + cn*b1;
        pnext = p-cn*q;
        /* Exact value, remainder is 0, break */
        if (pnext == 0) {
            a1 = anext;
            b1 = bnext;
        /* Numerator too large, break */
        if ((anext<-alim)||(anext>alim)) {
        /* Shift one step before next iteration */
        ap = a1;
        bp = b1;
        a1 = anext;
        b1 = bnext;
        p = q;
        q = pnext;
    if (a1<0){a1=-a1;b1=-b1;}
    *ares = a1;
    *bres = b1;