Not unrolling, vectorizing - the way I wrote the iterator appears to prevent
the compiler vectorizing the loop using SSE.
I guess i was thinking of the vectorization being a type of unrolling, not really in the correct sense i guess.
Where it was expanding it by say by 4 or how ever wide the vector operator is, reducing the number of iterations total.
On the add case i was assuming it would vectorize something like i ,i+1, i+2, i+3.
If hardware (SSE) vectorization isn't going to be on the cards for most
operations, I think the active index method is looking like a winner.
Generally speaking it seems to have more reliable performance characteristics,
especially in the face of incoherence.
As for this and the rest, I don't know enough about the system yet and how it actually works. I was basing it more on your test code and some of the earlier discussion on SIMD shaders.
After looking at your numbers more if the only case that performed better was the all on, because of the vectorization. Then monitoring and predicting wouldn't help.
Because it would just be a simple check # = N. And from your example i was thinking might have two code paths.
So for your add it would be like
if(nActive==nTotal) //vectorized path
for (int j = 0; j < nActive; ++j) {
c[j] = a[j] + b[j];
}
else //non-vectorized path
for (int j = 0; j < nActive; ++j) {
int i = activeIndex[j];
c[i] = a[i] + b[i];
}
In a case where the operator couldn't vectorize etc, you would only need the one option.
But things probably are not that simple and im sure there is a lot more going on that I am missing.
Jeremy
On Fri, Jan 22, 2010 at 10:32 PM, Chris Foster
<chri...@...> wrote:
On Fri, Jan 22, 2010 at 5:22 PM, Wormszer <
worm...@...> wrote:
> That's interesting, it kind of relates to my original question if the
> compiler was able to apply SIMD operations to the loop.
> When you disabled vectorization did it effect the active index case?
No, the active index case isn't vectorized by the compiler anyway.
> Are those numbers taking into account the setup time to create either the
> active index or the intervals?
No. In a SIMD shader machine I generally expect that creating the runstate
representation (whatever it may be) from the results of a conditional is going
to be a relatively small proportion of the total runtime.
> The iterator idea crossed my mind too but I too wouldn't of expected it to
> have such a performance hit either. I guess it prevents the compiler from
> unrolling the loop?
Not unrolling, vectorizing - the way I wrote the iterator appears to prevent
the compiler vectorizing the loop using SSE.
> I wonder if the way you use the iterator is having an effect, where a
> for(begin, end, ++) implementation etc, if the compiler would do something
> different.
I don't know. I know nothing about how gcc's tree vectorizer works. If it's
enabled by a heuristic whenever it sees special "simple" uses of the for loop,
then any iterator abstraction is unlikely to work.
> It looks like active index is the way to go,
If hardware (SSE) vectorization isn't going to be on the cards for most
operations, I think the active index method is looking like a winner.
Generally speaking it seems to have more reliable performance characteristics,
especially in the face of incoherence.
> i wonder why it doesn't perform
> as well on the full range, is it because of the indirection that the
> compiler won't vectorize it? That the memory addresses may not be
> consecutive?
Yes, I think that's the reason.
> If the two methods perform well under different conditions is there enough
> of a benefit to say implement both, active intervals/indexs? Or a hybrid,
> Active index, and if the # of index's == N, then its all on and could just
> do a loop without indirection and use a vectorized code path.
I considered something like this too. IMHO, making things this complex
requires that the SIMD state iteration should be abstracted, but an iterator
abstraction isn't appropriate in that case.
> Is there enough coherence between frame to frame, execution to execution,
> that you could possibly score the run and use that method the next time?
> Sort of like branch prediction, have some method to measure the coherence or
> incoherence of the current run to predict the next, even occasionally.
That doesn't sound like it would help to me ;-) You need to modify the
runstate at every conditional branch anyway, so it's possible to analyse it for
coherence during modification, if necessary.