Main Archive Specials Wiki | FAQ Links Submit Forum

Reasonably accurate/fastish tanh approximation

References : Posted by Fuzzpilz

Notes :
Fairly obvious, but maybe not obvious enough, since I've seen calls to tanh() in code snippets here.

It's this, basically:

tanh(x) = sinh(x)/cosh(x)
= (exp(x) - exp(-x))/(exp(x) + exp(-x))
= (exp(2x) - 1)/(exp(2x) + 1)

Combine this with a somewhat less accurate approximation for exp than usual (I use a third-order Taylor approximation below), and you're set. Useful for waveshaping.

Notes on the exp approximation:

It only works properly for input values above x, but since tanh is odd, that isn't a problem.

exp(x) = 1 + x + x^2/(2!) + x^3/(3!) + ...

Breaking the Taylor series off after the third term, I get

1 + x + x^2/2 + x^3/6.

I can save some multiplications by using

6 + x * (6 + x * (3 + x))

instead; a division by 6 becomes necessary, but is lumped into the additions in the tanh part:

(a/6 - 1)/(a/6 + 1) = (a - 6)/(a + 6).


I haven't looked at this in very great detail, but it's always <= the real tanh (>= for x<0), and the greatest deviation occurs at about +/- 1.46, where the result is ca. .95 times the correct value.

This is still faster than tanh if you use a better approximation for the exponential, even if you simply call exp.
There are probably additional ways of improving parts of this, and naturally if you're going to use it you'll want to figure out whether your particular application offers additional ways of simplifying it, but it's a good start.

Code :
/* single precision absolute value, a lot faster than fabsf() (if you use MSVC++ 6 Standard - others' implementations might be less slow) */
float sabs(float a)
int b=(*((int *)(&a)))&0x7FFFFFFF;
return *((float *)(&b));

/* approximates tanh(x/2) rather than tanh(x) - depending on how you're using this, fixing that could well be wasting a multiplication (though that isn't much, and it could be done with an integer addition in sabs instead) */
float tanh2(float x)
float a=sabs(x);
return ((x<0)?-1:1)*(a-6)/(a+6); /* instead of using <, you can also check directly whether the sign bit is set ((*((int *)(&x)))&0x80000000), but it's not really worth it */


Added on : 18/09/04 by fuzzpilz[ AT ]gmx[ DOT ]net
Comment :
Not sure why this didn't occur to me earlier, but you can easily save another two adds as follows:

return ((x<0)?-1:1)*a/(a+12);

Added on : 06/10/04 by kaleja[ AT ]estarcion[ DOT ]com
Comment :
You shouldn't need the sabs() on VC6 - you just need to add:

#pragma intrinsic( fabs )

before calling fabsf(), and it should go optimally fast.

Added on : 07/10/04 by Laurent de Soras
Comment :
You can optimise it a bit more by using the fact that tanh (x) = 1 - 2 / (exp (2*x) + 1)

Added on : 07/10/04 by fuzzpilz[ AT ]gmx[ DOT ]net
Comment :
AFAIK intrinsics aren't supported by VC6 Standard, but limited to Professional and Enterprise. Might be wrong, though, in which case I am a silly person. (no time to check now)

Added on : 23/03/05 by Christian[ AT ]savioursofsoul[ DOT ]de
Comment :
Delphi Code:

// approximates tanh(x/2) rather than tanh(x) - depending on how you're using
// this, fixing that could well be wasting a multiplication
function tanh2(x:single):Single;
var a : single;
if (x<0)
  then result:=-a/(a+24)
  else result:= a/(a+24);

Added on : 23/03/05 by Christian[ AT ]savioursofsoul[ DOT ]de
Comment :
Laurent de Soras wrote:
"You can optimise it a bit more by using the fact that tanh (x) = 1 - 2 / (exp (2*x) + 1)"

It's not faster, because you'll need 3 more cycles. The routine would then look like this:

function tanh2(x:single):Single;
var a : single;
if (x<0)
  then result:= (-1+24/a)
  else result:=  (1-24/a);

Added on : 08/05/05 by didid[ AT ]skynet[ DOT ]be
Comment :
I must have missed this one..
but why is the comparison needed?

a simpler version would be:


So in asm:

function Tanh2(x:Single):Single;
const c3 :Single=3;
      c6 :Single=6;
        FLD     x
        FLD     ST(0)
        FLD     c3
        FADD    ST(0),ST(1)
        FMUL    ST(0),ST(1)
        FADD    c6
        FMULP   ST(2),ST(0)
        FADD    c12
        FDIVP   ST(1),ST(0)

..but almost all the CPU is wasted by the division anyway

Added on : 08/05/05 by didid[ AT ]skynet[ DOT ]be
Comment :
wait.. has anyone tested this function?

Here's a test plot:


Green=the approximation suggested in this thread

Blue=another approximation that does:

function TanH3(x:Single):Single;
Result:=x - x*x*x/3 + 2*x*x*x*x*x/15;

If I didn't do anything wrong, the green one is VERY far from TanH. Blue is both closer & computationally more efficient.

But ok, this plot is for a normalized 0..1. When you go above, the blue like goes crazy.
But now considering that -1..1 is what matters the most for what we do, the input could still be clipped.

Added on : 08/05/05 by didid[ AT ]skynet[ DOT ]be
Comment :
forget all this :)

it's all embarrassing bullshit and I obviously need some sleep :)

Added on : 08/05/05 by didid[ AT ]skynet[ DOT ]be
Comment :
Ok ignore my above crap that I can't delete, I swear that this one does work :)

First I hadn't seen that this function was assuming x*2, so my graph was scaled by 2..

Second, the other algo (blue line) is still not to be neglected (because no FDIV) when the input is in the -1..1 range, and it does work.

Third, I'm suggesting here a version without the comparison/branching, but still, the CPU difference is neglectable because of the FDIV.

Here it is (this one does NOT assume a premultiplied x)..

plain code:

function Tanh2(x:Single):Single;
var   a,b:Single;


function Tanh22(x:Single):Single;
const c3  :Single=3;
      c6  :Single=6;
      c12 :Single=12;
        FLD     x
        FMUL    Mul2
        FLD     ST(0)
        FABS                 // a
        FLD     c3
        FADD    ST(0),ST(1)
        FMUL    ST(0),ST(1)
        FADD    c6           // b
        FMUL    ST(2),ST(0)  // x*b
        FMULP   ST(1),ST(0)  // a*b
        FADD    c12
        FDIVP   ST(1),ST(0)

Added on : 09/05/05 by Christian[ AT ]savioursofsoul[ DOT ]de
Comment :
Any suggestions about improving the 3DNow Divide-Operation??? I simply hate my code...

procedure Transistor2_3DNow(pinp,pout : PSingle; Samples:Integer;Scalar:Single);
const ng   : Array [0..1] of Integer = ($7FFFFFFF,$7FFFFFFF);
      pg   : Array [0..1] of Integer = ($80000000,$80000000);
      c2   : Array [0..1] of Single = (2,2);
      c3   : Array [0..1] of Single = (3,3);
      c6   : Array [0..1] of Single = (6,6);
      c12  : Array [0..1] of Single = (12,12);
      c24  : Array [0..1] of Single = (24,24);
shr ecx,1
movd       mm1, Scalar.Single
punpckldq  mm1, mm1
movq       mm0, c2
pfmul      mm0, mm1

movq       mm3, c3
movq       mm4, c6
movq       mm5, c12
movq       mm6, c24

movq       mm1, [eax] //mm1=input
pfmul      mm1, mm0   //mm1=a
movq       mm2, mm1   //mm1=a,   mm2=a

pand       mm2, ng    //mm1=a,   mm2=|a|

pfadd      mm3, c3    //mm1=a,   mm2=|a|, mm3=|a|+3
pfmul      mm3, mm2   //mm1=a,   mm2=|a|, mm3=|a|*(|a|+3)
pfadd      mm3, c6    //mm1=a,   mm2=|a|, mm3=6+|a|*(3+|a|)
pfmul      mm3, mm2   //mm1=a,   mm2=|a|, mm3=|a|*(6+|a|*(3+|a|))
pfadd      mm3, c12   //mm1=a,   mm2=|a|, mm3=b=12+|a|*(6+|a|*(3+|a|))
pfmul      mm1, mm3   //mm1=a*b, mm2=|a|, mm3=a*(12+|a|*(6+|a|*(3+|a|)))
pfmul      mm2, mm3   //mm1=a*b, mm2=|a|*b
pfadd      mm2, c24   //mm1=a*b, mm2=|a|*b+24

movq       mm3, mm2
pfrcp      mm4, mm3
punpckldq  mm3, mm3
pfrcpit1   mm3, mm4
pfrcpit2   mm3, mm4
movq       mm4, mm2
punpckhdq  mm4, mm4
pfrcp      mm5, mm4
pfrcpit1   mm4, mm5
pfrcpit2   mm4, mm5
punpckldq  mm4, mm5
pfmul      mm1, mm4
movq       [edx],mm1
add        eax,8
add        edx,8
loop    @Start

Added on : 09/05/05 by didid[ AT ]skynet[ DOT ]be
Comment :
mmh why the loop? You can't process more than 2 Tanh in parallel in this filter, can you?
What CPU gain did you get btw?

Anyway, sucks that 3Dnow doesn't provide division.. SSE does, though.. DIVPS (or DIVPD to get a double accuracy in this case) would work here. Only problem is that on an AMD I usually get better performances out of 3DNow than SSE/SSE2.

Added on : 10/05/05 by Christian[ AT ]savioursofsoul[ DOT ]de
Comment :
The loop works perfectly well, but it's of course for Tanh() processing of a whole block instead of inside the moog filter.

The thing, that 3DNow doesn't provide division really sucks. Anyway, this way i will save a small amount of performance, but it's not huge. But i believe one can optimize the 12 lines of division further more. Also data prefetching might help a little. Or restructuring, because on AMD the order does matter!

I'll SSE/SSE2 the code tonight. I think SSE gives a good performance boost, but SSE2 precisition would be needed, if the thing is inside the moog filter (IIR Filter coefficients should allways stay 64bit to avoid digital artifacts).



Added on : 11/05/05 by Christian[ AT ]savioursofsoul[ DOT ]de
Comment :
Here's the Analog Devices "Sharc" DSP translation of the tanh function (inline processing of register f0):

f11 = 2.;
f0  = f0 * f11;
f11 = abs f0;
f3  = 3.;    
f12 = f11+f3;
f12 = f11*f2;
f3  = 6.;
f12 = f12+f3;
f0  = f0*f12;
f12 = f11*f12;
f7  = 12.;
f12 = f12+f3;
f11 = 2.;
f0 = recips f12, f7=f0;
f7=f0*f7, f0=f11-f12;
f7=f0*f7, f0=f11-f12;
f7=f0*f7, f0=f11-f12;    

it can be optimized further more, but hey...

Added on : 25/02/06 by Gene
Comment :
tanh(x/2)~ x/(abs(x)+3/(2+x*x))


Added on : 25/02/06 by Gene
Comment :
tanh(x/2) ~ x/(abs(x)+2/(2.12-1.44*abs(x)+x*x))

Maximum normalized difference 0.0063 from real tanh (x/2) - good enough now.

Add your own comment
Comments are displayed in fixed width, no HTML code allowed!


Are you human?

Site created and maintained by Bram
Graphic design by line.out | Server sponsered by fxpansion