On Fri, Jan 22, 2010 at 4:56 AM, Larry Gritz <l...@...> wrote:
I want to point out that Chris F tested out the "all on", "random on", and
"alternating on/off" cases. There's one more case that may be important,
which is a few isolated "on" points with big "off" gaps in between -- and in
that case, I expect Chris F's solution to perform even better (compared to
what we have now) than the other cases.
Right, here's the initialization code for some sparse on-states:
// Four isolated flags turned on.
std::fill((Runflag*)r, r+len, (Runflag)Runflag_Off);
r[0] = r[50] = r[100] = r[150] = Runflag_On;
results for this sparse case:
run flags: 1.710s
active intervals: 0.100s
Of course in this case, the run flags completely fail to captialize on the
fact that most of the shading elements are turned off, so the active intervals
formulation thrashes it. Out of curiosity, I've also implemented the direct
indexing array Chris K suggested (code attached, see the function
addIndexed() ):
active index: 0.050s
as expected, blazingly fast here!
Here's the rest of the benchmarks redone with the active indexing method added:
All flags on (completely coherent):
run flags: 1.310s
active intervals: 0.690s
active index: 1.330s
Random flags on (completely incoherent):
run flags: 5.440s
active intervals: 3.310s
active index: 0.760s
Alternate flags on (maximum number of active intervals):
run flags: 1.500s
active intervals: 2.150s
active index: 0.710s
The results are quite interesting. They suggest that active indexing is likely
to be faster for highly incoherent data, but that active intervals is the clear
winnier when all your flags are on.
The random flags benchmark is particularly interesting to me because the active
intervals formulation does a lot worse than the active index in this case.
As for the implementation, you can potentially have the ability to change
between any of these run state implementations if you introduce an iterator
abstraction. The tricky thing seems to be making sure the abstraction is
getting optimized as efficiently as the plain loops.
Here's my crack at an active intervals iterator class, but alas, the benchmarks
show that this gives a time of 1.170s for the all-on case compared to 0.68s for
the loop-based implementation.
class RunStateIter
{
private:
const int* m_intervals; ///< active intervals
const int* m_intervalsEnd; ///< end of active intervals
int m_currEnd; ///< end of current interval
int m_idx; ///< current index
public:
RunStateIter(const int* intervals, int nIntervals)
: m_intervals(intervals),
m_intervalsEnd(intervals + 2*nIntervals),
m_currEnd(nIntervals ? intervals[1] : 0),
m_idx(nIntervals ? intervals[0] : 0)
{ }
RunStateIter& operator++()
{
++m_idx;
if (m_idx >= m_currEnd) {
m_intervals += 2;
if (m_intervals < m_intervalsEnd) {
m_idx = m_intervals[0];
m_currEnd = m_intervals[1];
}
}
return *this;
}
bool valid() { return m_idx < m_currEnd; }
int operator*() { return m_idx; }
};
Now why is this? The above is essentially the loop-based code but unravelled
into an iterator. I had to look at the assembly code to find out, and it turns
out that the compiler is optimizing addIntervals() using hardware SIMD! (I spy
lots of movlps and an addps in there.)
Ack! I hadn't expected that. I was imagining that the efficiency gains in the
all-on case came from improving branch prediciton or some such. Oops :-)
Using the flag -fno-tree-vectorize causes performance of the active intervals
code in the all-on case to devolve to 1.33s, just the same as the runflags
code.
So, this changes a few things, because (1) I guess there's not that many
operations which the compiler can produce hardware SIMD code for, so the
efficiency gains I've just shown may evaporate if user-defined types come into
play and (2) the compiler (g++) seems to have trouble producing hardware SIMD
code when an iterator abstraction is involved. That's a pity because using an
iterator would let you reimplement this stuff once and decide what the iterator
implementation should be later, based on real benchmarks.
Hum, so suddenly the way forward doesn't seem quite so clear anymore. The
active index method starts to look very attractive if we discount hardware
vectorization.
~Chris.