On Fri, Jan 22, 2010 at 5:22 PM, Wormszer <worm...@...> wrote:
That's interesting, it kind of relates to my original question if the
compiler was able to apply SIMD operations to the loop.
When you disabled vectorization did it effect the active index case?
No, the active index case isn't vectorized by the compiler anyway.
Are those numbers taking into account the setup time to create either the
active index or the intervals?
No. In a SIMD shader machine I generally expect that creating the runstate
representation (whatever it may be) from the results of a conditional is going
to be a relatively small proportion of the total runtime.
The iterator idea crossed my mind too but I too wouldn't of expected it to
have such a performance hit either. I guess it prevents the compiler from
unrolling the loop?
Not unrolling, vectorizing - the way I wrote the iterator appears to prevent
the compiler vectorizing the loop using SSE.
I wonder if the way you use the iterator is having an effect, where a
for(begin, end, ++) implementation etc, if the compiler would do something
different.
I don't know. I know nothing about how gcc's tree vectorizer works. If it's
enabled by a heuristic whenever it sees special "simple" uses of the for loop,
then any iterator abstraction is unlikely to work.
It looks like active index is the way to go,
If hardware (SSE) vectorization isn't going to be on the cards for most
operations, I think the active index method is looking like a winner.
Generally speaking it seems to have more reliable performance characteristics,
especially in the face of incoherence.
i wonder why it doesn't perform
as well on the full range, is it because of the indirection that the
compiler won't vectorize it? That the memory addresses may not be
consecutive?
Yes, I think that's the reason.
If the two methods perform well under different conditions is there enough
of a benefit to say implement both, active intervals/indexs? Or a hybrid,
Active index, and if the # of index's == N, then its all on and could just
do a loop without indirection and use a vectorized code path.
I considered something like this too. IMHO, making things this complex
requires that the SIMD state iteration should be abstracted, but an iterator
abstraction isn't appropriate in that case.
Is there enough coherence between frame to frame, execution to execution,
that you could possibly score the run and use that method the next time?
Sort of like branch prediction, have some method to measure the coherence or
incoherence of the current run to predict the next, even occasionally.
That doesn't sound like it would help to me ;-) You need to modify the
runstate at every conditional branch anyway, so it's possible to analyse it for
coherence during modification, if necessary.
~Chris.