run flags vs active intervals

Chris Foster <chri...@...>

Hi all,

I've been looking through the OSL source a little, and I'm interested to see
that you're using runflags for the SIMD state. I know that's a really
conventional solution, but there's an alternative representation which I reckon
is significantly faster, and I wondered if you considered it.

Imagine a SIMD runstate as follows

index: [ 0 1 2 3 4 5 6 7 8 9 ]

flags: [ 0 0 1 1 1 1 0 0 1 0 ]

An alternative to the flags is to represent this state as a list of active
intervals. As a set of active start/stop pairs, the state looks like

active = [ 2 6 8 9 ]


state: [ 0 0 1 1 1 1 0 0 1 0 ]
^ v ^ v
2 6 8 9

The advantage of doing this is that the inner SIMD loops become tighter since
there's one less test. Instead of

for (int i = 0; i < len; ++i)
if (state[i])

we'd have

for (int j = 0; j < nActiveTimes2; j+=2)
for (int i = active[j]; i < active[j+1]; ++i)

and the inner loop is now completely coherent.

I can see you have the beginnings of this idea in the current OSL code, since
you pass beginpoint and endpoint along with the runflags everywhere. However,
why not take it to its logical conclusion?

Given that you already have beginpoint and endpoint, the particularly big win
here is when most of the flags are turned on. If do_something() is a simple
arithmetic operation (eg, float addition) the difference between the two
formulations can be a factor of two in speed.

I'm attaching some basic test code which I whipped up. Timings on my core2 duo
with gcc -O3 look like:

All flags on (completely coherent):

run flags: 1.310s
active intervals: 0.690s

Random flags on (completely incoherent):

run flags: 5.440s
active intervals: 3.310s

Alternate flags on (maximum number of active intervals -> worst case):

run flags: 1.500s
active intervals: 2.150s