Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: avoid async_trait macro for IcebergWriter and provide extra dyn trait for object safety #760

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

wenym1
Copy link

@wenym1 wenym1 commented Dec 6, 2024

Previously, we use async_trait for trait IcebergWriter and IcebergWriterBuilder. For traits implemented with async_trait, all call to the async methods will generate a BoxedFuture, which may incur unnecessary cost in box allocation.

In this PR, we will avoid using async_trait for the two traits, so that the BoxedFuture can be optionally avoided. To retain the object-safety, we provide with the object-safe counterpart to the two traits, named DynIcebergWriter and DynIcebergWriterBuilder. We do impl IcebergWriter for Box<dyn DynIcebergWriter> and impl IcebergWriterBuilder for Box<dyn DynIcebergWriterBuilder>, so that the type erased dyn trait object can still be used as IcebergWriter and IcebergWriterBuilder. Nevertheless, for the two dyn traits, the futures generated from calling their async methods are still boxed futures.

Note that, after this PR, there can be backward compatibility issue in the public API of the library. When impl the two traits, we may have to remove the previous async_trait that wraps the impl block like what we did in this PR.

@wenym1
Copy link
Author

wenym1 commented Dec 6, 2024

cc @ZENOTME

@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 6, 2024

Thanks, @wenym1! I think PR is great so that we can't avoid some limits to designing the writer API because of object safety. cc @liurenjie1024 @Xuanwo @Fokko

@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 10, 2024

This PR needs to fix the conflict after #741. It change the interface of writer builder

@wenym1 wenym1 requested a review from ZENOTME December 10, 2024 04:43
@wenym1
Copy link
Author

wenym1 commented Dec 10, 2024

@ZENOTME Comments are addressed. PTAL

@wenym1 wenym1 requested a review from ZENOTME December 10, 2024 04:56
@liurenjie1024
Copy link
Contributor

Thanks @wenym1 for this pr, could you elaborate the benefit of this change? As you said, this may introduce breaking api change, why we need to do this? One point you mentioned is the box allocation, do we have measurement of how much this cost is compared with actual IO?

@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 10, 2024

Thanks @wenym1 for this pr, could you elaborate the benefit of this change? As you said, this may introduce breaking api change, why we need to do this? One point you mentioned is the box allocation, do we have measurement of how much this cost is compared with actual IO?

Personally, I think more important benefits of this PR is to provide extra dyn traits for object safety. After separating these two trait, we can design the inner trait without worrying about the object safety. It can avoid some problem like #703 (comment). Also, after this PR, our writer builder be object safe now, it originally isn't. In practice, we found it's useful for this because in some case, user want to store writer builder in some place and wrap it Box make things easier.

@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 12, 2024

I find that the implementation has some problems now, it will cause recursive calls endlessly and stack overflow finally.

Reproduce:

   #[tokio::test]
    async fn test_box_writer() {
        let temp_dir = TempDir::new().unwrap();
        let file_io = FileIOBuilder::new_fs_io().build().unwrap();
        let location_gen =
            MockLocationGenerator::new(temp_dir.path().to_str().unwrap().to_string());
        let file_name_gen =
            DefaultFileNameGenerator::new("test".to_string(), None, DataFileFormat::Parquet);

        let pw = ParquetWriterBuilder::new(
            WriterProperties::builder().build(),
            file_io.clone(),
            location_gen,
            file_name_gen,
        );
        let data_file_builder =
            DataFileWriterBuilder::new(Arc::new(Schema::builder().build().unwrap()), pw, None).boxed();

        // stack overflow here.
        let mut writer = data_file_builder.build().await.unwrap();
    }

@wenym1
Copy link
Author

wenym1 commented Dec 12, 2024

I find that the implementation has some problems now, it will cause recursive calls endlessly and stack overflow finally.

Reproduce:

   #[tokio::test]
    async fn test_box_writer() {
        let temp_dir = TempDir::new().unwrap();
        let file_io = FileIOBuilder::new_fs_io().build().unwrap();
        let location_gen =
            MockLocationGenerator::new(temp_dir.path().to_str().unwrap().to_string());
        let file_name_gen =
            DefaultFileNameGenerator::new("test".to_string(), None, DataFileFormat::Parquet);

        let pw = ParquetWriterBuilder::new(
            WriterProperties::builder().build(),
            file_io.clone(),
            location_gen,
            file_name_gen,
        );
        let data_file_builder =
            DataFileWriterBuilder::new(Arc::new(Schema::builder().build().unwrap()), pw, None).boxed();

        // stack overflow here.
        let mut writer = data_file_builder.build().await.unwrap();
    }

Thanks for pointing it out. The stack overflow was caused by accidentally repeatedly interleaving call on the build of IcebergWriterBuilder and DynIcebergWriterBuilder. Already fixed.

@Xuanwo
Copy link
Member

Xuanwo commented Dec 13, 2024

Hi, thank you @wenym1 for your work on this, and thanks to @ZENOTME and @liurenjie1024 for their reviews.

I'm a bit concerned about the complexity this PR introduces.

One point you mentioned is the box allocation, do we have measurement of how much this cost is compared with actual IO?

Given that users always utilize the dyn-compatible API from outside, I believe the box allocation cannot be avoided.

Personally, I think more important benefits of this PR is to provide extra dyn traits for object safety.

I thought IcebergWriter was already a dyn-compatible trait, but IcebergWriterBuilder is not. Maybe we could work on IcebergWriterBuilder directly? Perhaps we could modify it to take &self instead of self.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants