add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

mrchypark · 2024-08-06T18:26:47Z

This PR adds two new metadata fields to the Azure Blob Storage input:

blob_storage_total_files: The total number of files in the Azure Blob Storage container.
blob_storage_file_index: The current file index being processed.
These new metadata fields provide users with additional context about the progress of file processing in their Azure Blob Storage input.

Changes:

Added totalFiles and currentIndex fields to the azureBlobStorage struct.
Modified the Connect method to count the total number of files.
Updated the blobStorageMetaToBatch function to include the new metadata fields.
Incremented the currentIndex after processing each file in the ReadBatch method.
These changes will help users track the progress of their Azure Blob Storage input processing, especially when dealing with large numbers of files. The new metadata can be used for logging, monitoring, or implementing custom logic based on the processing progress.

Testing:

Tested the new metadata fields with various file counts in Azure Blob Storage containers.
Verified that the blob_storage_total_files remains constant throughout the processing.
Confirmed that the blob_storage_file_index increments correctly for each processed file.
Please review and let me know if any further changes or clarifications are needed.

…azure blob storage input

jem-davies · 2024-08-06T18:48:21Z

internal/impl/azure/input_blob_storage.go

 	}
 	return a, nil
 }

 func (a *azureBlobStorage) Connect(ctx context.Context) error {
 	var err error
 	a.keyReader, err = newAzureTargetReader(ctx, a.log, a.conf)
+
+	// Count total files
+	for {


What about if a file is added after the connection is made? Won't this information then become stale?

I wrote it considering only batch mode. After testing, it seems that in batch mode, the file list is fetched at the time of connection.

func newAzureTargetReader(ctx context.Context, logger *service.Logger, conf bsiConfig) (azureTargetReader, error) { if conf.FileReader == nil { return newAzureTargetBatchReader(ctx, conf) } return &azureTargetStreamReader{ input: conf.FileReader, log: logger, }, nil }

Looking at this, it seems that azureTargetStreamReader is distinguished by FileReader.

I still don't know whether the Azure SDK holds the list at the point of input generation, or if it updates it every time the pager operates. Rather than relying on this approach, it would be better if there was a way for the pager to indicate when it has reached the end.

add metadata blob_storage_total_files and blob_storage_file_index on …

3c758a1

…azure blob storage input

jem-davies reviewed Aug 6, 2024

View reviewed changes

jem-davies force-pushed the main branch from 171924c to 7b10c77 Compare November 4, 2024 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

mrchypark commented Aug 6, 2024

jem-davies Aug 6, 2024

mrchypark Aug 6, 2024

mrchypark Aug 10, 2024 •

edited

Loading

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

Are you sure you want to change the base?

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

Conversation

mrchypark commented Aug 6, 2024

jem-davies Aug 6, 2024

Choose a reason for hiding this comment

mrchypark Aug 6, 2024

Choose a reason for hiding this comment

mrchypark Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

mrchypark Aug 10, 2024 •

edited

Loading