PARQUET-34: implement Size() filter for repeated columns #3098

clairemcginty · 2024-12-05T20:04:51Z

Rationale for this change

this PR continues the work outlined in #1452. It implements a size() predicate for filtering on # of elements in repeated fields:

FilterPredicate hasThreeElements = size(intColumn("my_list_field"), Operators.Size.Operator.EQ, 3)

What changes are included in this PR?

Size() and not(size()) implemented for all list fields with required element type. Attempting to filter on a list of optional elements will throw an exception in the schema validator. This is because the existing record-level filtering setup (IncrementallyUpdatedFilterPredicateEvaluator) only feeds in non-null values to the ValueInspectors. thus if you had an array [1,2, null, 4] it would only count 3 elements. I can file a ticket to support this eventually but I think we'd have to rework the FilteringRecordMaterializer to be aware of repetition/definition levels.

The list group itself can be optional or required. Null lists are treated as having size 0. Again, this is due to difficulty disambiguating them at the record-level filtering step. (Would love feedback on both these design decisions!!)

Are these changes tested?

Unit tests + tested a snapshot build locally with real datasets

Are there any user-facing changes?

New Operators API

Part of #1452

clairemcginty · 2024-12-05T20:05:37Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+    // If all values have repetition level 0, then no array has more than 1 element
+    if (repetitionLevelHistogram.size() == 1
+        || repetitionLevelHistogram.subList(1, repetitionLevelHistogram.size()).stream()
+            .allMatch(l -> l == 0)) {
+
+      // Null list fields are treated as having size 0
+      if (( // all lists are nulls
+          definitionLevelHistogram.subList(1, definitionLevelHistogram.size()).stream()
+              .allMatch(l -> l == 0))
+          || // all lists are size 0
+          (definitionLevelHistogram.get(0) == 0
+              && definitionLevelHistogram.subList(2, definitionLevelHistogram.size()).stream()
+                  .allMatch(l -> l == 0))) {
+
+        final boolean blockCannotMatch =
+            size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> gt >= 0, (gte) -> gte > 0);
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+
+      long maxDefinitionLevel = definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);
+
+      // If all repetition levels are zero and all definitions level are > MAX_DEFINITION_LEVEL - 1, all lists
+      // are of size 1
+      if (definitionLevelHistogram.stream().allMatch(l -> l > maxDefinitionLevel - 1)) {
+        final boolean blockCannotMatch = size.filter(
+            (eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 1, (gte) -> gte > 1);
+
+        return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+      }
+    }
+    long nonNullElementCount =
+        repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - definitionLevelHistogram.get(0);
+    long numNonNullRecords = repetitionLevelHistogram.get(0) - definitionLevelHistogram.get(0);
+
+    // Given the total number of elements and non-null fields, we can compute the max size of any array field
+    long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
+    final boolean blockCannotMatch = size.filter(
+        (eq) -> eq > maxArrayElementCount,
+        (lt) -> false,
+        (lte) -> false,
+        (gt) -> gt >= maxArrayElementCount,
+        (gte) -> gte > maxArrayElementCount);
+
+    return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;


hopefully this is a faithful transcription of the logic outlined here: #1452 (comment)

clairemcginty · 2024-12-05T20:07:52Z

...et-hadoop/src/test/java/org/apache/parquet/filter2/statisticslevel/TestStatisticsFilter.java

+    assertFalse(canDrop(size(nestedListColumn, Operators.Size.Operator.EQ, 0), columnMeta));
+  }
+
+  private static SizeStatistics createSizeStatisticsForRepeatedField(


I'm dynamically generating SizeStatistics for each test case which does add a lot of LOC to the file--I could also just replace it with the computed SizeStatistics for each test case if that's simpler. I just wrote it this way originally because I wasn't that confident in my ability to translate the striping algorithm by hand for all these cases 😅

clairemcginty · 2024-12-05T20:09:59Z

...rc/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java

+    public CountingValueInspector(ValueInspector delegate, Function<Long, Boolean> shouldUpdateDelegate) {
+      this.observedValueCount = 0;
+      this.delegate = delegate;
+      this.shouldUpdateDelegate = shouldUpdateDelegate;


note: The "shouldUpdateDelegate" is needed since don't want to terminate prematurely with a false positive. For example if we're filtering on size(eq(3)) but the input array has 4 elements, we want to prevent the delegated Eq from returning true after it hits the third element because it thinks the condition is satisfied.

clairemcginty · 2024-12-05T20:10:52Z

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

@@ -378,6 +379,11 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> conta
          indices -> IndexIterator.all(getPageCount()));
    }

+    @Override
+    public PrimitiveIterator.OfInt visit(Size size) {
+      return IndexIterator.all(getPageCount());


repetitionLevelHistogram and definitionLevelHistogram are both in scope here, should I repeat the logic from StatisticsFilter or is that completely redundant?

clairemcginty · 2024-12-06T19:38:26Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

+      final boolean blockCannotMatch = size.filter(
+          (eq) -> eq < numDistinctValues,
+          (lt) -> lt <= numDistinctValues,
+          (lte) -> lte < numDistinctValues,
+          (gt) -> false,
+          (gte) -> false);
+


actually now that I think about it, this isn't accurate, since we don't know the distribution of values. I guess we could combine it with SizeStatistics to get the number of elements and work out the minimum size from there.

wgtmac · 2024-12-13T15:14:23Z

Thanks for adding this! This is a large PR that I need to take some time to review.

It would be good if @emkornfield @gszadovszky could take a look to see if this is a good use case for SizeStatistics.

clairemcginty · 2024-12-16T18:30:47Z

Thanks for adding this! This is a large PR that I need to take some time to review.

thanks, no rush on reviewing it! 👍

PARQUET-34: implement Size() filter for repeated columns

dc9cb19

clairemcginty commented Dec 5, 2024

View reviewed changes

clairemcginty added 3 commits December 5, 2024 15:14

PARQUET-34: Fix FilterApi signature

a0c6815

PARQUET-34: Test multiple size() predicates on different columns

914e5b2

PARQUET-34: Add ignore test for optional array field filter

9241ce2

clairemcginty marked this pull request as ready for review December 6, 2024 18:06

PARQUET-34: Fix DictionaryFilter logic

9f4b270

clairemcginty commented Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-34: implement Size() filter for repeated columns #3098

PARQUET-34: implement Size() filter for repeated columns #3098

clairemcginty commented Dec 5, 2024

clairemcginty Dec 5, 2024

clairemcginty Dec 5, 2024

clairemcginty Dec 5, 2024

clairemcginty Dec 5, 2024

clairemcginty Dec 6, 2024

wgtmac commented Dec 13, 2024

clairemcginty commented Dec 16, 2024

PARQUET-34: implement Size() filter for repeated columns #3098

Are you sure you want to change the base?

PARQUET-34: implement Size() filter for repeated columns #3098

Conversation

clairemcginty commented Dec 5, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

clairemcginty Dec 5, 2024

Choose a reason for hiding this comment

clairemcginty Dec 5, 2024

Choose a reason for hiding this comment

clairemcginty Dec 5, 2024

Choose a reason for hiding this comment

clairemcginty Dec 5, 2024

Choose a reason for hiding this comment

clairemcginty Dec 6, 2024

Choose a reason for hiding this comment

wgtmac commented Dec 13, 2024

clairemcginty commented Dec 16, 2024