I have created a grouped list of estimated cost of instructions according to this Chinese reference.
This may not be accurate, but is mostly correct from my experience.
Some intuitions are:
- Abs, saturate are free (Why is clamp in GLSL not free? I doubt it)
- Log, exp, sqrt are almost free! (That’s why Kernel Foveated Rendering is fast)
- Sin, cos are super fast!
- smoothstep is more expensive than expected.
- I would suggest a cheap replace for Guassians:
1 2 3 4 5 6 7 |
float cubicPulse( float c, float w, float x ) { x = fabs(x - c); if( x>w ) return 0.0; x /= w; return 1.0 - x*x*(3.0-2.0*x); } |
Here is the full grouped list:
- Cost 0 (Almost free)
- abs(x), saturate (x)
- Cost 1
- floor(x), ceil(x), round(x), frac(x), exp2(x), dot(a, b), min(a, b), max(a, b), sin(x), cos(x), sincos(x), sqrt(x), rsqrt(x)
- Cost 1.5
- faceforward(n, i, ng)
- Cost 2
- clamp(a, b), exp(x), log(x), log10(x), cross(a, b), step(a, x), lerp(a, b, f), length(v), distance(a, b)
- Cost 2.5
- reflect(i, n)
- Cost 3
- any(x), pow(x, y), sign(x), normalize(v),
- Cost 4
- all(x), fmod(4), mul(m, pos), transpose(M)
- Greater or equal to 5
- 7: smoothstep(min, max, x)
- 10: acos
- 11: asin
- 16: atan
- 22: atan2
One of my remaining question is:
- How fast is texture sampling on modern GPU?
- One option is to measure by Nvidia Perf https://developer.nvidia.com/nvidia-shaderperf
- I guess it’s 20
Any further experiments and feedbacks are welcome.
Unfortunately, looking at the number of asm instructions for an operation can be misleading. Some common reasons:
– asm operates on float4 registers, but modern GPUs operate on each float individually (from the thread’s perspective)
– some operations run on a different part of the hardware
– texture sample cost is mostly about memory latency, so it can be very cheap OR very expensive
I’d suggest you read these two articles if you want a deeper look at the costs involved:
http://www.humus.name/Articles/Persson_LowLevelThinking.pdf
http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf
Thank you Ben for your thoughtful comments and references! I will take a look later…
As for “texture sample cost is mostly about memory latency, so it can be very cheap OR very expensive”. Could you please give an example about the causes of the extreme cases? Thank you so much!