refactor: Remove spawn and channel inside arrow reader #806

Xuanwo · 2024-12-16T07:51:16Z

This PR will remove spwan and channel inside arrow reader so users can concurrently read data stream without extra cost.

Signed-off-by: Xuanwo <[email protected]>

Xuanwo · 2024-12-16T07:52:06Z

Also cc @sdd for a look, thank you!

sdd · 2024-12-20T08:56:54Z

I'm not sure this is what we want. try_buffer_unordered executes futures concurrently according to the docs: https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_buffer_unordered

This does not necessarily mean in parallel, which is what we want. I encountered a similar thing when we were originally using try_for_each_concurrent here. As per the note in the second paragraph of the docs here, no threads are introduced. The executor switches between each future on a single thread. We're CPU bound in the reader rather than IO-bound like in the planning phase, and so we want true parallelism.

The double-spawn approach is not as neat but it ensures parallelism.

Xuanwo · 2024-12-20T09:08:57Z

I'm not sure this is what we want. try_buffer_unordered executes futures concurrently according to the docs: docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_buffer_unordered

Hi, try_buffer_unordered does execute concurrently that multiple futures are been polled at the same time by multiple async runtime worker threads.

My ultimate goal is to eliminate concurrency_limit_data_files from our reading process and handle only FileScanTasks to create a Stream<Item=RecordBatch>. This approach will allow users to utilize our APIs to implement various reading strategies.

Our current implementation prevents the query engine from adopting Push-Based Execution, which should have control over the underlying I/O and CPU operations.

See also apache/arrow-rs#6907

sdd · 2024-12-20T19:03:29Z

Hi, try_buffer_unordered does execute concurrently that multiple futures are been polled at the same time by multiple async runtime worker threads.

In that case, great - try_buffer_unordered is much neater.

refactor: Remove spawn and channel inside arrow reader

6cef43a

Signed-off-by: Xuanwo <[email protected]>

Xuanwo requested review from Fokko and liurenjie1024 December 16, 2024 07:51

Xuanwo mentioned this pull request Dec 20, 2024

Table Scan Delete File Handling: Positional and Equality Delete Support #652

Open

sdd approved these changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Remove spawn and channel inside arrow reader #806

refactor: Remove spawn and channel inside arrow reader #806

Xuanwo commented Dec 16, 2024

Xuanwo commented Dec 16, 2024

sdd commented Dec 20, 2024

Xuanwo commented Dec 20, 2024 •

edited

Loading

sdd commented Dec 20, 2024

refactor: Remove spawn and channel inside arrow reader #806

Are you sure you want to change the base?

refactor: Remove spawn and channel inside arrow reader #806

Conversation

Xuanwo commented Dec 16, 2024

Xuanwo commented Dec 16, 2024

sdd commented Dec 20, 2024

Xuanwo commented Dec 20, 2024 • edited Loading

sdd commented Dec 20, 2024

Xuanwo commented Dec 20, 2024 •

edited

Loading