fix: teach zstd to include correct validity when decompressing utf8
Zstd only stores the non-null values. When decompressing integers, it iterates through the set
indices of the validity to "spread" the non-null values to their correct positions. When
decompressing utf8, no such work is done. As a result, the Validity's length (which is the correct
length of the array) does not match the length of the VarBinView's values (which are just the
non-null values). Unfortunately, zstd uses `VarBinViewArray::new_unchecked` so the validity's length
is never verified to equal the array's length.
I implemented the simplest fix of which I could think: wrap the VarBin array in a Sparse array whose
indices are the valid indices.
---
The test fails this way on develop:
```
running 1 test
test test::test_zstd_decompress_var_bin_view ... FAILED
failures:
---- test::test_zstd_decompress_var_bin_view stdout ----
thread 'test::test_zstd_decompress_var_bin_view' panicked at encodings/zstd/src/test.rs:181:5:
assertion `left == right` failed
left: Scalar { dtype: Utf8(Nullable), value: ScalarValue(BufferString(BufferString { string: "baz" })) }
right: Scalar { dtype: Utf8(NonNullable), value: ScalarValue(BufferString(BufferString { string: "Lorem ipsum dolor sit amet" })) }
```
Signed-off-by: Daniel King <dan@spiraldb.com>
230e78b
4 months ago
by danking
+10.64%
taplo fmt
Signed-off-by: Daniel King <dan@spiraldb.com>