The simulation code was written in C, developed in MS Visual Studio 6. It is available at: http://www.hars.us/SW/ModInv.c. GMP Version 4.1.2, the GNU multi precision arithmetic library [9] was used for the long integer operations and for verifying the results. It is linked as an MS Windows DLL, available also at http://www.hars.us/SW/gmpdll.zip.
We executed 1 million calls of each of the many variants of the modular inverse algorithms with 14 different lengths in the range of 16…1024 bit random inputs, so the experimental complexity results are expected to be accurate within 0.1…0.3% (Central Limit Theorem) at every operand length. The performed operations and their costs were counted separately for different kind of operations. The table at the end of the paper contains the binary costs of the additions and shifts the corresponding modular inverse algorithms performed, and the number of iterations and the number of shifts with the most frequent lengths. (Multiple shifts are combined together.) The computed curves fit to the data with less than 1% error at any operand length.
The rightshift algorithms are the slowest, because they halve two auxiliary variables (R, S) and if they happen to be odd, m is added or subtracted first, to make them even for the halving. Theoretical arguments and also our computational experiments showed that they are too slow at digit serial arithmetic. They were included in the discussions mainly, because there are surprisingly many systems deployed using some variant of the rightshift algorithm, although others are much better.
The addition steps are not needed in the leftshift or in the shifting Euclidean algorithms. In all three groups of algorithms the length of U and V decreases bitbybit in each iteration, and in the leftshift and shifting Euclidean algorithms the length of R and S increases steadily from 1. In the right shift case they get very soon as long as m, except in the delayed halving variant. In the average, the changing lengths roughly halve the work on those variables. Also, the necessary additions of m in the original rightshift algorithms prevent aggregation of the shift operations of R and S. On the other hand, in the other algorithms (including the delayed halving right shift algorithm) we can first determine by how many bits we have to shift all together in that phase. In the leftshift algorithms, dependent on the relative magnitude of u and v we need only one or two shifts by multiple bits, in the shifting Euclidean algorithm only one. This shift aggregation saves work at longer shifts than the most common lengths of 1 or 2.
On the other hand, the optimum shift lengths in the leftshift and shifting Euclidean algorithms are only estimated from the MS bits. They are sometimes wrong, while in the rightshift algorithm only the LS bits play a role, so the optimum shift lengths can always be found exactly. Accordingly, the rightshift algorithms perform slightly fewer iterations (8.6…10%), but the large savings in additions in the other algorithms offset these savings.
4.1Software Running Time Comparisons
We did not measure execution times of SW implementations, because of the following reasons:

The results are very much dependent on the characteristics of the hardware platforms (word length, instruction timings, available parallel instructions, length and function of the instruction pipeline, processor vs. memory speed, cache size and speed, number of levels of cache memory, page fault behavior, etc.).

The results also depend on the operating system (multitasking, background applications, virtual/paging memory handling, etc.).

The results are dependent on the code, the programming language and the compiler. For example, GMP [9] uses hand optimized assembler macros, and any other SW written in a higher level language is necessarily disadvantaged, like at handling carries.
In earlier publications running time measurements were reported, like in [5] Jebelean gave software execution time measurements of GCD algorithms on a DEC computer of RISC architecture. Our measurements on a 3 GHz Intel Pentium PC running Windows XP gave drastically different speed ratios. This large uncertainty was the reason why we decided to count the number of specific operations and sum up their time consumption dependent on the operand lengths, instead of the much easier running time measurements. This way the actual SW running time can be well estimated on many different computational platforms.
4.2Notes on the Simulation Results 
The number of the different UV shifts, together, is the number of iterations, since there is one combined shift in each iteration.

In the leftshift algorithms the sum of RS shifts is larger than the number of iterations, because some shifts may cause the relationship between u and v to change, and in this case there are 2 shifts in one iteration.

In [14] there are evidences cited that the binary right shift GCD algorithm performs A·log^{ }2^{}^{m}^{} iterations, with A = 1.0185… The RS1 algorithm performs the same number of iterations as the binary right shift GCD algorithm. Our experiments gave a very similar (only 0.2% smaller) result: A' = 0.7045_{/}log^{ }2 = 1.0164…
In the table at the end of the paper we listed the coefficients of the dominant terms of the best fit polynomials to the time consumption of the algorithms, in 3 typical computational models:
1. Shifts execute in a constant number of clock cycles
Algorithm LS3 is the fastest (0.6662^{ }n^{2}), followed by SE3 (0.6750^{ }n^{2}), with only a 1.3% lag. The best rightshift algorithm is RSDH+−, which is 1.66 times slower (1.1086^{ }n^{2}).
2. Shifts are 4 times faster than add/subtracts
Algorithm SE3 is the fastest (0.8114^{ }n^{2}), followed by LS3 (0.9043^{ }n^{2}), within 14%. The best rightshift algorithm (RSDH+−) is 1.71 times slower (1.3858^{ }n^{2}).
3. Shifts and add/subtracts take the same time
Again, algorithm SE3 is the fastest (1.2176^{ }n^{2}), followed by SE (1.3904^{ }n^{2}), within 14%. The best rightshift algorithm (RSDH+−) is 2.37 times slower (2.8804^{ }n^{2}).
Interestingly the PlusMinus algorithm RS+−, which only assures that U or V are reduced by at least 2 bits, performs fewer iterations, but the overall running time is not improved. When R and S are also handled this way, the running time improves. It shows that speeding up the (R,S) halving steps is more important than speeding up the (U,V) reduction steps, because the later reduction steps operate on diminishing length numbers, while the (R,S) halving works mostly on more costly, full length numbers.
Share with your friends: 