On some computational platforms speed increase can be achieved with delayed full update of the variables. See, for example [4], [5] or [7]. It means working with MS and/or LS digits only, as long as we can determine the necessary reduction steps, and fix the rest of the digits only when more precision is needed. Speedup is achieved by the reduced number of data fetch, and combined update operations on the middle digits. Unfortunately, the resulting algorithms are much more complex, less suitable for direct hardware implementations and the combined operations involve multiplications, what we wanted to avoid. In our computational model data loadstore operations are free, so having fewer of them doesn’t provide any speed advantage.
5.2Hybrid Algorithms
The right shift and the shifting Euclidean algorithms operate on the opposite ends of the numbers. At least when the modulus is odd, the reduction steps could be intermixed: check on a few bits on the appropriate end of the intermediate values which algorithm is expected to reduce the lengths the most, and perform one step of it. However, the right shift algorithms are so much slower, that the corresponding reduction is almost never expected to be faster than the shifting Euclidean reduction. This way we could not achieve any significant speedup, but the algorithms became complicated and convoluted.
5.3Modular Division
If the initialization of S ← 1 is replaced by S ← d, we get da^{−1} as the result, as described in [22] for polynomials. In case the modular inverse is only needed once, and it is multiplied by another number, we could save that multiplication, like in elliptic curve cryptography. If the inverse is reused many times, like at signed digit exponentiation ([15]), this trick does not improve performance.
We start with a full length S (=d) instead of length 1, so (S,R) do not gradually increase from length 1, but start at length n. Further steps are necessary to prevent them to grow too large. These more than double the total work updating (S,R) (but not (U,V)) at the leftshift and shifting Euclidean algorithms, all together 50…100% increase. The right shift algorithms do not change much, so modular divisions significantly reduce their performance lag. In general, it only pays doing divisions this way, when the underlying modular inverse algorithm is much faster than two modular multiplications (making a modular division faster than 3 modular multiplications).
6.References 
D. E. Knuth, “The Art of Computer Programming”, volume 2, “Seminumerical Algorithms”, 3rd edition, AddisonWesley, 1998.
http://wwwcsfaculty.stanford.edu/~knuth/taocp.html

D.R. Hankerson, A.J. Menezes, S.A. Vanstone, “Guide to Elliptic Curve Cryptography”, Springer, 2004.

T. Jebelean, “A Generalization of the Binary GCD Algorithm”, ISSAC 93, pp. 111116. Technical report version available
ftp://ftp.risc.unilinz.ac.at/pub/techreports/1993/9301.ps.gz

T. Jebelean, “A DoubleDigit LehmerEuclid Algorithm for Finding the GCD of Long Integers”, Journal of Symbolic Computation, volume 19, 1995, pp. 145157. Technical report version also available ftp://ftp.risc.unilinz.ac.at/pub/techreports/1992/9269.ps.gz

T. Jebelean, “Comparing several GCD algorithms” In ARITH11: IEEE Symposium on Computer Arithmetic (Windsor, Canada, June). IEEE, New York, 180185.

R. P. Brent and H. T. Kung, “Systolic VLSI arrays for lineartime GCD computation”, in V. Anceau, E.J. Aas (eds.), VLSI’83. 1983 Elsevier (North Holland), pp 145154.

K. Weber, “The accelerated integer GCD algorithm”, ACM Transactions on Mathematical Software, volume 21, number 1, March 1995, pp. 111122.

A. Schönhage, V. Strassen, “Schnelle Multiplikation großer Zahlen”, Computing 7, 1971, pp. 281292.

GNU Multiple Precision Arithmetic Library manual
http://www.swox.com/gmp/gmpman4.1.2.pdf

A. Menezes, P. van Oorschot, and S. Vanstone, “Handbook of Applied Cryptography”, CRC Press, 1996.

H. Cohen, A. Miyaji, and T. Ono, “Efficient elliptic curve exponentiation using mixed coordinates”. In K. Ohta and D. Pei, editors, Advances in Cryptology  ASIACRYPT 98, Lecture Notes in Computer Science, No. 1514, pages 5165. Springer, Berlin, Germany, 1998. http://citeseer.ist.psu.edu/article/cohen98efficient.html

R. Lórencz, “New Algorithm for Classical Modular Inverse”, CHES 2002, LNCS 2523, pp 5770. SpringerVerlag, 2003.

M. D. Ergovac, T. Lang, “Digital Arithmetic” Chapter 2. Morgan Kaufmann Publishers, 2004.

B. Vallée, “Complete Analysis of the Binary GCD” 1998. http://citeseer.ist.psu.edu/79809.html

J. Jedwab and C. J. Mitchell. “Minimum weight modified signeddigit representations and fast exponentiation”. Electronics Letters, 25(17):11711172, 17. August 1989.

L. Hars, “Long Modular Multiplication for Cryptographic Applications”. http://eprint.iacr.org/2004/198/ CHES 2004, (LNCS 3156, pp 4461. SpringerVerlag, 2004.)

L. Hars, “Fast Truncated Multiplication and its Applications in Cryptography”, (CHES 2005), Edinburgh.

E. Savaş, Ç. K. Koç: “The Montgomery Modular Inverse – Revisited”, IEEE Transactions on Computers, 49(7), pp. 763766, July 2000

B.S. Kaliski Jr. “The Montgomery inverse and its applications”, IEEE Transactions on Computers, 44(8) pp. 1064–1065, August 1995.

P.L. Montgomery. “Modular multiplication without trial division”, Mathematics of Computation, 44(170) pp. 519–521, April 1985.

M. Joye and P. Paillier, “GCDfree Algorithms for Computing Modular Inverses”. Cryptographic Hardware and Embedded Systems  CHES 2003, SpringerVerlag, 2003, LNCS 2779, pp. 243–253.

S. C. Shantz. “From Euclid's GCD to Montgomery Multiplication to the Great Divide”. Technical Report TR200195, Sun Microsystems Laboratories, 2001.

R. Schroeppel, H. Orman, and S. O’Malley, "Fast Key Exchange with Elliptic Curve Systems," Tech. Report 95−03, Department of Computer Science, The University of Arizona, Tucson.

J. Stein, "Computational problems associated with Racah algebra", Journal of Computational Physics, 1, 1967.
Share with your friends: 