

Optimizing Big Number Arithmetic Without SSE - zvrba
http://accu.org/index.php/journals/1849

======
simias
IMHO the problem here is the capabilities of the optimizing compiler, not the
language itself. I would argue that it should be the compiler's job to figure
these optimizations, otherwise you might as well use inline ASM on compiler
intrinsics since it won't be portable anyway.

There are certain features of the C standard that forces the compiler to
generate non-optimal code by default in certain cases (like aliasing rules, as
any FORTRAN proponent will remind you). I don't think it's the case in TFA.

So basically, blame the optimizer, not the language.

P.S.: I used this cached version since the site appears to be down:
[http://webcache.googleusercontent.com/search?q=cache:accu.or...](http://webcache.googleusercontent.com/search?q=cache:accu.org/index.php/journals/1849&strip=1)

~~~
mrich
You could argue that it is a fault of the C language that it does not expose
the carry flag of the CPU. This makes the most optimal implementation of e.g.
addition impossible without resorting to (inline) assembler.

Of course not every CPU has a carry flag...

~~~
simias
Except that exposing too much hardware state to the coder might give less room
for the compiler to optimize aggressively for all architectures.

What if the architecture doesn't have a carry? Then the compiler has to
generate one when needed.

What if the architecture has better ways to do the addition anyway? What if
the architecture is only 8bit and even regular "int" adds take several
instructions anyway, so it makes so sense opmimizing for 128bit adds?

C doesn't even mandate what happens in case of an integer overflow, it has no
business exposing carry flags.

If I ever write a C tutorial I'll title it "How I learned to stop worrying and
love the compiler".

EDIT: of course in this case you can't really "let the compiler handle it" if
you care about bignum performances. The only solution here would be

a/ expose much more hardware state to the language, which as I've already
stated doesn't sound very convenient or C-ish enough for me,

b/ modify the optimizer to better handle bignum code or

c/ use a bignum library and possibly make it part of the standard.

I wouldn't be opposed to a standard <bignum.h>, I think it would make some
sense. In the meantime I just use gmp.h, it's faster than anything I could
come up with anyway, asm or otherwise. Bignum is hard.

~~~
zvrba
> What if the architecture doesn't have a carry? Then the compiler has to
> generate one when needed.

Adding two N-bit numbers produces N+1-bit result, and C gives you no direct
means of accessing the full result. IMO, this is a language defect and has
little to do with HW support. [This pertains also to multiplication: NxN bits
-> 2N-bit result.]

If the hardware actually returns the full result, excellent; if no, the
compiler has to synthesize code to compute it, _if needed_. Whether it's
needed is inferrable by the code that follows, e.g., the type of the variable
the result is assigned to.

IMO, inferring that only a partial result is used (e.g., carry is discarded)
is much easier for the optimizer than inferring what some instruction sequence
is supposed to do. (E.g., 4+ multiplications and few additions => single
32x32=>64 multiply.)

------
erichocean
A relatively easy way for a C programmer to bridge the gap is to use something
like DynASM[0].

Alternatively, for SPMD-like problems, Intel's open source ISPC[1] compiler is
a pretty easy way to benefit from the SIMD hardware in your processor.

[0] [http://luajit.org/dynasm.html](http://luajit.org/dynasm.html)

[1] [http://ispc.github.io/](http://ispc.github.io/)

~~~
dman
ispc is a research project so I would think twice before using it in
production.

------
nkurz
Article poses good questions, although the load time is terrible[1]. Here's
the intro while you are waiting for it:

    
    
      There is one thing which has puzzled me for a while, and 
      it is the performance of programs written in C when it 
      comes to big numbers. It may or may not help with the  
      decades-long ‘C vs Fortran’ performance debate, but let’s 
      concentrate on one single and reasonably well-defined 
      thing – big number arithmetic in C and see if it can be 
      improved.
    
      In fact, there are very few things which gain from being 
      rewritten in assembler (compared to C), but big number 
      arithmetic is one of them, with relatively little progress 
      in this direction over the years. Let’s take a look at 
      OpenSSL (a library which is among the most concerned about 
      big number performance: 99% of SSL connections use RSA 
      these days, and RSA_performance == Big_Number_Performance, 
      and RSA is notoriously sslloooowww).
    
      ...
    
      OpenSSL prefers to use assembler for big number 
      calculations. It was the case back in 1998, and it is 
      still the case now (last time I checked, the difference 
      between C and asm implementations was 2x, but it was long 
      ago, so things may have easily changed since). But why 
      should this be the case? Why with all the optimizations 
      compilers are doing now, should such a common and trivial 
      thing still need to be written in asm (which has its own 
      drawbacks – from the need to write it manually for each 
      and every platform, to sub-optimality of generic asm when  
      it comes to pipeline optimizations – and hand-rewriting 
      asm for each new CPU -march/-mtune is not realistic)? If 
      it can perform in asm but cannot perform in C, it means 
      that all hardware support is present, but the performance 
      is lost somewhere in between C developer and generated 
      binary code; in short – the compiler cannot produce good 
      code. This article will try to perform some analysis of 
      this phenomenon.
    

[1] Maybe they are using IPoAC? It sometimes exhibits high packet loss with
heavy rain, particularly on international links:
[http://en.wikipedia.org/wiki/IP_over_Avian_Carriers](http://en.wikipedia.org/wiki/IP_over_Avian_Carriers)

~~~
aortega
> But why should this be the case? Why with all the optimizations compilers
> are doing now, should such a common and trivial thing still need to be
> written in asm?

Yes, if the CPU vendor keeps coming with all kinds of clowny instructions like
MMX/SSE2/SSE4/SSE XP/etc to accelerate things (but really because they need
new instructions so the ABI remains patented). Compilers and libraries can't
keep with all the new instructions.

~~~
dbaupp
High performance libraries do keep up with the new instructions, e.g. GMP[1]
likes the new MULX[2] instruction a lot.

[1]: [https://gmplib.org/list-archives/gmp-
devel/2013-August/00335...](https://gmplib.org/list-archives/gmp-
devel/2013-August/003353.html) [2]:
[http://www.intel.com/content/www/us/en/intelligent-
systems/i...](http://www.intel.com/content/www/us/en/intelligent-
systems/intel-technology/ia-large-integer-arithmetic-paper.html)

------
josteink
When hearing a claim such as "there is a substantial gap between the
capabilities of hardware and C", my first question is "what hardware"?

Last I checked, C was used on lots of hardware, lots of it which differs
substantially from the X86 architecture.

C aims to be portable. For it to be portable its constructs must be so too.

If there exists hardware which doesn't have a carry-flag (like mentioned
elsewhere in this thread), exposing it to the C-programmer will result in non-
portable code, or a need for (possibly inefficient) compiler-internal
workarounds.

Looking at it the other way around, namely upwards the abstraction layer, C
has a very limited capability to convey the _intent_ of code. Lots of hardware
has the capability to process things in parallel, and C doesn't have any way
to exploit that without explicit code to do so.

And as long as C is meant to be portable, that _has_ to be allright.

~~~
sanxiyn
Carry flag emulation may be "possibly inefficient" on architectures without
carry flag, but not exposing carry flag on architectures with carry flag is
definitely inefficient. I don't think it is the right tradeoff.

Consider that many architectures on which C can be used lack hardware
multipliers. This is totally not an argument against having multiplication in
the language.

------
fulafel
Current CPU's are largely C machines, with a few extensions past C's edges. It
was not always so, see eg P-machines, AS/400, Lisp machines, even current
GPUs.

------
tempodox
The current page is shown because the Blocklayout Template Engine failed to
render the page, however this could be due to a problem not in BL itself but
in the template. BL has raised or has left uncaught the following exception:

Database Query Error

Description: ErrorNo: 2006, Message:Database error while executing: 'SELECT
inst.xar_id as bid, btypes.xar_type as type, btypes.xar_module as module,
inst.xar_name as name, inst.xar_title as title, inst.xar_content as content,
inst.xar_last_update as last_update, inst.xar_state as state,
group_inst.xar_position as position, bgroups.xar_id AS bgid, bgroups.xar_name
AS group_name, bgroups.xar_template AS group_bl_template, inst.xar_template AS
inst_bl_template, group_inst.xar_template AS group_inst_bl_template FROM
xar_block_group_instances group_inst LEFT JOIN xar_block_groups bgroups ON
group_inst.xar_group_id = bgroups.xar_id LEFT JOIN xar_block_instances inst ON
inst.xar_id = group_inst.xar_instance_id LEFT JOIN xar_block_types btypes ON
btypes.xar_id = inst.xar_type_id WHERE bgroups.xar_name = 'header' AND
inst.xar_state > 0 ORDER BY group_inst.xar_position ASC'; error description
is: 'MySQL server has gone away'.

Explanation: A database query could not be executed, either because the query
could not be understood or because it returned unexpected results.

Product: App - Modules

Component: Articles

------
dschiptsov
Yeah, yeah, pointers are evil, memory management is hard, cache locality is a
horrible mess, lame, sequential spaghetti code does not automatically scaled
by a "stupid" compiler, so we all should use Java and NodeJS VMs which
eliminate the necessity to think, and even to have such capacity. With VMs
questions of optimal performance just never arise.)

~~~
alexchamberlain
Except if you are writing the VM...

------
ye
Mirror:

[https://web.archive.org/web/20140208005209/http://accu.org/i...](https://web.archive.org/web/20140208005209/http://accu.org/index.php/journals/1849)

