Re: run flags vs active intervals

Wormszer <worm...@...>

That's interesting, it kind of relates to my original question if the compiler was able to apply SIMD operations to the loop.
When you disabled vectorization did it effect the active index case?

Are those numbers taking into account the setup time to create either the active index or the intervals? Or is it basically the same for each method?

The iterator idea crossed my mind too but I too wouldn't of expected it to have such a performance hit either. I guess it prevents the compiler from unrolling the loop?
I wonder if the way you use the iterator is having an effect, where a for(begin, end, ++) implementation etc, if the compiler would do something different.

It looks like active index is the way to go, i wonder why it doesn't perform as well on the full range, is it because of the indirection that the compiler won't vectorize it? That the memory addresses may not be consecutive?

If the two methods perform well under different conditions is there enough of a benefit to say implement both, active intervals/indexs? Or a hybrid,  Active index, and if the # of index's == N, then its all on and could just do a loop without indirection and use a vectorized code path.

Is there enough coherence between frame to frame, execution to execution, that you could possibly score the run and use that method the next time?
Sort of like branch prediction, have some method to measure the coherence or incoherence of the current run to predict the next, even occasionally.


On Thu, Jan 21, 2010 at 8:36 PM, Chris Foster <chri...@...> wrote:
On Fri, Jan 22, 2010 at 4:56 AM, Larry Gritz <l...@...> wrote:
> I want to point out that Chris F tested out the "all on", "random on", and
> "alternating on/off" cases.  There's one more case that may be important,
> which is a few isolated "on" points with big "off" gaps in between -- and in
> that case, I expect Chris F's solution to perform even better (compared to
> what we have now) than the other cases.

Right, here's the initialization code for some sparse on-states:

   // Four isolated flags turned on.
   std::fill((Runflag*)r, r+len, (Runflag)Runflag_Off);
   r[0] = r[50] = r[100] = r[150] = Runflag_On;

results for this sparse case:

run flags:        1.710s
active intervals: 0.100s

Of course in this case, the run flags completely fail to captialize on the
fact that most of the shading elements are turned off, so the active intervals
formulation thrashes it.  Out of curiosity, I've also implemented the direct
indexing array Chris K suggested (code attached, see the function
addIndexed() ):

active index:     0.050s

as expected, blazingly fast here!

Here's the rest of the benchmarks redone with the active indexing method added:

All flags on (completely coherent):

run flags:        1.310s
active intervals: 0.690s
active index:     1.330s

Random flags on (completely incoherent):

run flags:        5.440s
active intervals: 3.310s
active index:     0.760s

Alternate flags on (maximum number of active intervals):

run flags:        1.500s
active intervals: 2.150s
active index:     0.710s

The results are quite interesting.  They suggest that active indexing is likely
to be faster for highly incoherent data, but that active intervals is the clear
winnier when all your flags are on.

The random flags benchmark is particularly interesting to me because the active
intervals formulation does a lot worse than the active index in this case.

As for the implementation, you can potentially have the ability to change
between any of these run state implementations if you introduce an iterator
abstraction.  The tricky thing seems to be making sure the abstraction is
getting optimized as efficiently as the plain loops.

Here's my crack at an active intervals iterator class, but alas, the benchmarks
show that this gives a time of 1.170s for the all-on case compared to 0.68s for
the loop-based implementation.

class RunStateIter
       const int* m_intervals; ///< active intervals
       const int* m_intervalsEnd; ///< end of active intervals
       int m_currEnd;          ///< end of current interval
       int m_idx;              ///< current index
       RunStateIter(const int* intervals, int nIntervals)
           : m_intervals(intervals),
           m_intervalsEnd(intervals + 2*nIntervals),
           m_currEnd(nIntervals ? intervals[1] : 0),
           m_idx(nIntervals ? intervals[0] : 0)
       { }

       RunStateIter& operator++()
           if (m_idx >= m_currEnd) {
               m_intervals += 2;
               if (m_intervals < m_intervalsEnd) {
                   m_idx = m_intervals[0];
                   m_currEnd = m_intervals[1];
           return *this;

       bool valid() { return m_idx < m_currEnd; }

       int operator*() { return m_idx; }

Now why is this?  The above is essentially the loop-based code but unravelled
into an iterator.  I had to look at the assembly code to find out, and it turns
out that the compiler is optimizing addIntervals() using hardware SIMD!  (I spy
lots of movlps and an addps in there.)

Ack!  I hadn't expected that.  I was imagining that the efficiency gains in the
all-on case came from improving branch prediciton or some such.  Oops :-)
Using the flag -fno-tree-vectorize causes performance of the active intervals
code in the all-on case to devolve to 1.33s, just the same as the runflags

So, this changes a few things, because (1) I guess there's not that many
operations which the compiler can produce hardware SIMD code for, so the
efficiency gains I've just shown may evaporate if user-defined types come into
play and (2) the compiler (g++) seems to have trouble producing hardware SIMD
code when an iterator abstraction is involved.  That's a pity because using an
iterator would let you reimplement this stuff once and decide what the iterator
implementation should be later, based on real benchmarks.

Hum, so suddenly the way forward doesn't seem quite so clear anymore.  The
active index method starts to look very attractive if we discount hardware


You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To post to this group, send email to osl...@....
To unsubscribe from this group, send email to osl...@....
For more options, visit this group at

Join to automatically receive all group messages.