Prime field elliptic curve cryptosystems (ECC) are gaining popularity especially in embedded systems, because of their smaller need in processing power and memory than RSA or ElGamal. Modular inverses are used extensively during point addition, doubling and multiplication (See more details in [2]). 20….30% overall speedup is possible, just with the use of a better algorithm.
An elliptic curve E over GF(p) (the field of residues modulo the prime p) is defined as the set of points (x,y) (together with the point at infinity O) satisfying the reduced Weierstraß equation:
E : f^{ }(X,Y) =^{∆} Y^{ }^{2} – X^{ }^{3} – aX – b 0 mod p.
In elliptic curve cryptosystems the data to be encrypted is represented by a point P on a chosen curve. Encryption by the key k is performed by computing Q = P + P + … + P = k∙P. Its security is based on the hardness of computing the discrete logarithm in groups. This operation, called scalar multiplication (the additive notation for exponentiation), is usually computed with the doubleandadd method (the adaptation of the wellknown squareandmultiply algorithm to elliptic curves, usually with signed digit recoding of the exponent [15]). When the resulting point is not the point at infinity O, the addition of points P = (x_{P},y_{P}) and Q = (x_{Q},y_{Q}) leads to the resulting point R = (x_{R},y_{R}) through the following computation:
x_{R} = λ^{2} – x_{P} – x_{Q} mod p
y_{R} = λ∙(x_{P} – x_{R}) –y_{P} mod p
where
λ = (y_{P}–y_{Q}) / (x_{P}–x_{Q}) mod p if P =/ Q
λ = (3x^{2}_{P}+a) / (2y_{P}) mod p if P = Q
Here the divisions in the equations for λ are shorthand notations for multiplications with the modular inverse of the denominator. P = (x_{P},y_{P}) is called the affine representation of the elliptic curve point, but it is also possible to represent points in other coordinate systems, where the field divisions (multiplications with modular inverses) are traded to a larger number of field additions and multiplications. These other point representations are advantageous when computing the modular inverse is much slower than a modular multiplication. In [11] the reader can find discussions about point representations and the corresponding costs of elliptic curve operations.
2.1Multiplications
There are situations where the modular inverse has to be, or it is better calculated without any multiplication operations. These include

If the available multiplier hardware is slow.

If there is no multiplier circuit in the hardware at all. For example, on computational platforms where long parallel adders perform multiplications by repeated shiftadd operations. (See [13] for fast adder architectures.)

For RSA key generation in cryptographic processors, where the multiplier circuit is used in the background for the exponentiations of the (MillerRabin) primality test. [10]

In prime field elliptic or hyper elliptic curve cryptosystems, where the inversion can be performed parallel to other calculations involving multiplications.
Of course, there are also computational platforms, where multiplications are better used for modular inverse calculations. These include workstations with very fast or multiple multiplier engines (could be three: ALU, Floating point multiplier and Multimedia Extension Module).
In digit serial arithmetic engines there is usually a digitbydigit multiplier circuit (for 8…128 bit operands), which can be utilized for calculating modular inverses. This multiplier is the slowest circuit component; other parts of the circuit can operate at much higher clock frequency. Appropriate hardware designs, with faster nonmultiplicative operations, can defeat the speed advantage of those modular inverse algorithms, which use multiplications. This way faster and less expensive hardware cores can be designed.
This kind of hardware architecture is present in many modern microprocessors, like the Intel Pentium processors. They have 1 clock cycle base time for a 32bit integer add or subtract instruction (discounting operand fetch and other overhead), and they can sometimes be paired with other instructions for concurrent execution. A 32bit multiply takes 10 cycles (a divide takes 41 cycles), and neither can be paired.
The algorithms considered in this paper process the bits or digits of their long operands sequentially, so in a single cycle fetching more neighboring digits (words) into fast registers allows the use of slower, cheaper RAM, or pipeline registers.
We will use only add/subtract, compare and shift operations. With trivial hardware enhancements the shift operations can be done “on the fly” when the operands are loaded for additions or subtractions. This kind of parallelism is customarily provided by DSP chips, and it results in a close to twofold speedup of the shifting xGCD based modular inverse algorithms.
Shift operations could be implemented with manipulating pointers to the bits of a number. At a subsequent addition/subtraction the hardware can provide the parameter with the corresponding offset, so arbitrary long shifts take only a constant number of operations with this offsetload hardware support. (See [16].) Even in traditional computers these pointer manipulating shift operations save time, allowing multiple shift operations to be combined into a longer one.
Share with your friends: 