Latest Results
fix: filter pushdown for nested fields
In #5295, we accidentally broke nested filter pushdown. The issue is
that FileSource::try_pushdown_filters seems like it's meant to evaluate
using the whole file schema, rather than any projected schema. As an
example, in the Github Archive benchmark dataset, we have the following
query, which should trivially pushdown and be pruned, executing about
30ms or so:
```
SELECT COUNT(*) from events WHERE payload.ref = 'refs/head/main'
```
However, after this change, pushdown of this field was failing, pushing
query time up 100x. The root cause is that the old logic attempted to
apply the file schema to the source_expr directly.
Concretely, for the gharchive query, the whole expression is something
like:
```text
BinaryExpr {
lhs: GetField {
source_expr: Column { name: "payload", index: 0 },
field_expr: Literal { value: "ref" }
}
rhs: Literal { value: "refs/head/main" }
operator: Eq
}
```
The issue is that the column index 0 is wrong for the whole file.
Instead, we need to recursively ensure that the source_expr is a valid
sequence of Column and GetField expressions that resolve properly.
Note how we already were doing this for checking if a standalone Column
expression can be pushed down:
```
} else if let Some(col) = expr.downcast_ref::<df_expr::Column>() {
schema
.field_with_name(col.name())
.ok()
.is_some_and(|field| supported_data_types(field.data_type()))
```
Signed-off-by: Andrew Duffy <andrew@a10y.dev>aduffy/filter-pushdown-fix fix: filter pushdown for nested fields
In #5295, we accidentally broke nested filter pushdown. The issue is
that FileSource::try_pushdown_filters seems like it's meant to evaluate
using the whole file schema, rather than any projected schema. As an
example, in the Github Archive benchmark dataset, we have the following
query, which should trivially pushdown and be pruned, executing about
30ms or so:
```
SELECT COUNT(*) from events WHERE payload.ref = 'refs/head/main'
```
However, after this change, pushdown of this field was failing, pushing
query time up 100x. The root cause is that the old logic attempted to
apply the file schema to the source_expr directly.
Concretely, for the gharchive query, the whole expression is something
like:
```text
BinaryExpr {
lhs: GetField {
source_expr: Column { name: "payload", index: 0 },
field_expr: Literal { value: "ref" }
}
rhs: Literal { value: "refs/head/main" }
operator: Eq
}
```
The issue is that the column index 0 is wrong for the whole file.
Instead, we need to recursively ensure that the source_expr is a valid
sequence of Column and GetField expressions that resolve properly.
Note how we already were doing this for checking if a standalone Column
expression can be pushed down:
```
} else if let Some(col) = expr.downcast_ref::<df_expr::Column>() {
schema
.field_with_name(col.name())
.ok()
.is_some_and(|field| supported_data_types(field.data_type()))
```
Signed-off-by: Andrew Duffy <andrew@a10y.dev>aduffy/filter-pushdown-fix Active Branches
#54060%
#5405+11%
#53990%
© 2025 CodSpeed Technology