You could get rid of the dynamic loop with a sparse kernel (keeping the number of samples constant and only adjusting the width of the kernel based on velocity). It'll be a bit grainy at large kernel sizes but look reasonably good for motion blur and should be pretty quick as well. If there is a downsampled version (size/2) of the texture, or if it happens to have mip levels) you could use a half sized version to reduce the number of texture samples - every tex2D call expands to up to eight or so instructions - obviously the downsampled texture would have to be supplied by the engine so this can't be done by just tweaking the shader
Also, to get a better sample rate, adjusting the sample position to be at the corners of texels ( s + 1/texWidth and t+1/texHeight) should improve quality a bit, because bilinear filtering will give you an exact average over four texels with every sample
Just my $0.02. I might try this later tonight to see what it looks like