3.1.3Justification
The algorithm starts with U = m, V = a, R = 0, S = 1. In the course of the algorithm U and V are decreased, keeping GCD(U,V) = GCD(m,a) true. The algorithm reduces U and V until V = 0 and U = GCD(m,a): If one of U or V is even, it can be replaced by its half, since GCD(m,a) is odd. If both are odd, the larger one can be replaced by the even U−V, which then can be decreased by halving, leading eventually to 0. The binary length of the larger of U and V is reduced by at least one bit, guaranteeing that the procedure terminates in at most a+m iterations.
At termination of the algorithm V = 0 otherwise a length reduction was still possible. U = GCD(U,0) = GCD(m,a). Furthermore, the calculations maintain the following two congruencies:
U ≡ Ra mod m, V ≡ Sa mod m. (1)
Having an odd modulus m, at the step halving U we have two cases. When R is even: U_{/}2 ≡ (R_{/}2)·a mod m, and when R is odd: U/2 ≡ ((R+m)_{/}2)·a mod m. The algorithm assigns these to U and R. Similarly for V and S, and with their new values (1) remains true.
The difference of the two congruencies in (1) gives U−V ≡ (R−S)·a mod m, which ensures that at the subtraction steps (1) remains true after updating the corresponding variables: U or V ← U−V, R or S ← R−S. Choosing +m or −m, as discussed above, guarantees that R and S does not grow larger than 2m, and immediately reduced to R, S < m, so at the end we can just add m if necessary to ensure 0 < R < m. If U = 1 = GCD(m,a) we get 1 ≡ Ra mod m, and R is of the right magnitude, so R = a^{−1} mod m. []
3.1.4PlusMinus: Algorithm RS+−
There is a very simple modification often used for the rightshift algorithm [6]: for the odd U and V check, if U+V has 2 trailing zero bits, otherwise we know that U−V does. In the former case, if U+V is of the same length as the larger of them, the shift operation reduces the length by 2 bits from this larger length, otherwise by only one bit (as before with the rigid subtraction steps). It means that the length reduction is sometimes improved, so the number of iterations decreases.
Unfortunately, this reduction is not large, only 15% (half of the time the reduction was by at least 2 bits, anyway, and longer shifts are not affected either), but it comes almost for free. Furthermore, R and S need more halving steps, and these get a little more expensive (at least one of the halving steps needs an addition of m), so the RS+− algorithm is not faster than RS1.
3.1.5
U m; V a;
R 0; S 1;
Q = m mod 4;
while (V_{0} = 0) { V V/2;
if(S_{0} = 0) S S/2;
elseif(S^{ }>^{ }m) S (Sm)/2;
else S (S+m)/2;}
Loop { // U, V odd
if (U > V) {
if (U_{1} = V_{1})
U U  V; R R  S;
else
U U + V; R R + S;
U U/4; T R mod_{ }4;
if (T = 0) R R/4;
elseif (T = 2) R (R+2m)/4;
elseif (T = Q) R (Rm)/4;
else R (R+m)/4;
while (U_{0} = 0) { U U/2;
if (R_{0} = 0) R R/2;
elseif(R^{ }>^{ }m) R (Rm)/2;
else R (R+m)/2;}
else {
if (U_{1} = V_{1})
V V  U; S S  R;
else
V V + U; S S + R;
if (V = 0) break;
V V/4; T S mod_{ }4;
if (T = 0) S S/4;
elseif (T = 2) S (S+2m)/4;
elseif (T = Q) S (Sm)/4;
else S (S+m)/4;
while (V_{0} = 0) { V V/2;
if (S_{0} = 0) S S/2;
elseif(S^{ }>^{ }m) S (Sm)/2;
else S (S+m)/2;}
}
if (U > 1) return 0; //no inverse
if (R < 0) R R + m;
return R; //a^{1} mod m

Figure 2. Double plusminus Rightshift binary algorithm
 Double Plusminus: Algorithm RS2+−
The plusminus reduction can be applied also to R and S (Figure 2). In the course of the algorithm they get halved, too. If one of them happens to be odd, m is added or subtracted to make them even before the halving. The plusminus trick on them ensures that the result has at least 2 trailing 0bits. It provides a speedup, because most of the time we had exactly two divisions by 2 (shift right by two), and no more than one addition/subtraction of m is now necessary.
The variables R and S get almost immediately of the same length as m, because, when they are odd, m is added to them to allow halving without remainder. We can delay these addhalving steps, by doubling the other variable instead. When R should be halved we double S, and vice versa. Of course, a powerof2 spurious factor is introduced to the computed GCD, but keeping track of the exponent a final correction step will fix R by the appropriate number of halving or addhalving steps. (This technique is similar to the Montgomery inverse computation published in [19] and sped up for computers in [18], but the correction steps differ.) It provides an acceleration of the algorithm by 24…38% over RS1, due to the following:

R and S now increase gradually, so their average length is only half as it was in RS1.

The final halving steps are performed only with R. The variable S need not be fixed, being only an internal temporary variable.

At the final halving steps more short shifts can be combined to longer shifts, because they are not confined by the amount of shifts performed on U and V in the course of the algorithm.
Note1: R and S are almost always of different lengths, and so their difference is not longer than the longer of R and S. Consequently, their lengths don't increase faster than what the shifts cause.
Note2: it does not pay to check, if R or S is even, in the hope that some halving steps could be performed until the involved R or S becomes odd, and so speeding up the final correction, because they are already odd in the beginning (easily proved by induction).
Share with your friends: 