Use platform-aware fast rounding for FloatNearest::to_u8/to_u16
Introduce fast_round_f32 that delegates to hardware rounding
(roundss/frintn) on SSE 4.1 and aarch64, and falls back to the mantissa
snapping trick ((x + 2^23) - 2^23) elsewhere to avoid the costly libm
roundf call.