Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added options to fine-tune settings for bulk operations #43509

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Dec 20, 2024

Description

This PR allows fine-tuning some setting for bulk ingestion

  • Spark: Config spark.cosmos.write.bulk.minTargetBatchSize can be used to override the minimum target batch size (the target batch isze is calculated based on throttling rate and by default can be reduced to 1 - this setting allows increasing the minimum traget batch size).
  • Core SDK: The following system properties/environment variables can be used to fine-tune the default settings for bulk ingestion
    • System property COSMOS.MIN_TARGET_BULK_MICRO_BATCH_SIZE/Environment variable COSMOS_MIN_TARGET_BULK_MICRO_BATCH_SIZE
      • Can be used to change the default minimum batch size. The target batch size is calculated dynamically - but will not be lower than this value. The default value is 1 - it should only be increased if the service-side provisioned throughput cannot be saturated with default settings (for example because the client machine hits bottlenecks in CPU-usage).
    • ´System property COSMOS.MAX_BULK_MICRO_BATCH_CONCURRENCY/Environment variable COSMOS_MAX_BULK_MICRO_BATCH_CONCURRENCY
      • Can be used to change the default concurrency - how many micro-batches can be sent to a single physical partition. The default value is 1 - it should only be increased if the service-side provisioned throughput cannot be saturated - for example because the network roundtrip-latency between the client and service is too high.
    • System property COSMOS.MAX_BULK_MICRO_BATCH_FLUSH_INTERVAL_IN_MILLISECONDS/Environment variable COSMOS_MAX_BULK_MICRO_BATCH_FLUSH_INTERVAL_IN_MILLISECONDS
      • Can be used to change the default interval in which operations are sent to the service even when the micro batch size has not reached its limit yet. The default value is 1000ms - it should only be changed when there is a need to commit changes sooner than after 1 second - for example for real-time processing.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 7 changed files in this pull request and generated no comments.

Files not reviewed (2)
  • sdk/cosmos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala: Language not supported
  • sdk/cosmos/azure-cosmos-spark_3_2-12/src/test/scala/com/azure/cosmos/spark/SparkE2EWriteITest.scala: Language not supported
Comments suppressed due to low confidence (1)

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/batch/PartitionScopeThresholdsTest.java:44

  • Corrected the method name from 'alwaysThrottledShouldResultInBatSizeOfOne' to 'alwaysThrottledShouldResultInBatchSizeOfOne'.
public void alwaysThrottledShouldResultInBatchSizeOfOne() {
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@FabianMeiswinkel FabianMeiswinkel changed the title Added option to override configuration for minTargetBulkBatchSize Added options to fine-tune settings for bulk operations Dec 20, 2024
Copy link
Member

@jeet1995 jeet1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - except 1 question.

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants