Add option for embedding generation upon BioImage Ingestion #482

ktsitsi · 2023-11-09T14:44:41Z

This PR:

Introduces an extra functionality inside the ingestion process of biomedical images. The new ingestion function on the client side as well as the UDF has been enhanced with extra arguments that enable the creation of image embeddings during the ingestion
The newly added ingestion arguments are the following:

    :param embedding_model: The model to be used for creating embedding. Supported values are of type class EMBEDDINGS
    :param embedding_level: The resolution level to be used for the embedding. This could be different from the ingestion level
    selected with parameter `level`
    :param embedding_grid: A tuple that represents the (num_of_rows, num_of_cols), in which the image will be split in patches
    for the embedding creation. According to this grid internally the image is being split to fit this requirement.

Currently the supported model is RESNET for the embeddings, but with the introduced types abstraction we are able to enhance this functionality in the future.
Another type SupportedExtensions has been added concerning the file extensions we support with our ingestion function.
This additional functionality changes the Data model specification we have set and thus updates the stored image groups to fmt_version = 3

The resulted image is shown below:

Building a local docker image of the UDFs I was able to test it and the results are shown below as well:

This PR should be followed by PR's that will do the following and will be linked upon creation here:

Update the python-udf-imaging dockerfile in TileDB-REST-UDF-DOCKER-IMAGES with the dependencies needed as tiledb-vector-search and tensorflow
Also update the tiledb-cloud version in UDFs to match the one after the release of this PR, so functions and modules are being found on the UDF side

shortcut-integration · 2023-11-09T14:44:45Z

This pull request has been linked to Shortcut Story #33097: Add option for embedding generation upon ingest.

thetorpedodog · 2023-11-14T16:09:51Z

src/tiledb/cloud/bioimg/__init__.py

+from .helpers import batch
+from .helpers import get_embeddings_uris
+from .helpers import scale_calc
+from .helpers import serialize_filter


these helpers don’t seem like they would be public-facing.

thetorpedodog · 2023-11-14T16:10:30Z

src/tiledb/cloud/bioimg/helpers.py

+
+def get_embeddings_uris(output_file_uri: str) -> Tuple[str, str]:
+    destination = os.path.dirname(output_file_uri)
+    filename = os.path.basename(output_file_uri).split(".")


check out os.path.splitext

* Separating supported extensions from #482

…n-for-embedding-generation-upon-ingest

ihnorton · 2023-12-12T03:47:40Z

src/tiledb/cloud/bioimg/ingestion.py

+                    filtered = split_in_patches(image, embedding_level, embedding_grid)
+
+                    patches_array = np.array([])
+                    for patch in filtered:


(also seems slow..)

ihnorton · 2023-12-12T03:47:55Z

src/tiledb/cloud/bioimg/ingestion.py

+                    region_width = level_shape_w // grid_col_num
+                    # Loop through the image and extract each region
+                    patches = []
+                    for i in range(grid_row_num):


This seems like it is going to be very slow?

NikolaosPapailiou · 2024-01-12T08:08:14Z

src/tiledb/cloud/bioimg/ingestion.py

+                        np.array(embeddings.shape, dtype="uint32").tofile(f)
+                        np.array(embeddings).astype("float32").tofile(f)
+
+                    vs.ingest(


Can you explain a bit the use case that you want to support?

From what I understand this is creating one vector index per image, storing the embedding of different patches of the image. Does this mean that you want to find similar patches within the image? Do you want to do any cross image queries?

The idea here is to find similar patches across multiple images. We don't have any specific customer request for any of that. The idea behind the patches though is that in this data environment is quite unlikely to find similarities across whole images. Each image is quite different from one another as a whole based on the depicted cell. Also from my knowledge due to the blank space outside the cells captured by sensor it is quite possible to introduce a lot of noise in the model if fed to it as a whole. So my thought was that a future user would like to search for similarities across images of a specific region of the query image, or in a next step to be able to query a region from the viewer and find similar "abnormalities" - that he can justify and I can't- across the images.

NikolaosPapailiou · 2024-01-12T08:09:22Z

src/tiledb/cloud/bioimg/ingestion.py

+
+                    vs.ingest(
+                        index_type="FLAT",
+                        index_uri=embeddings_flat_uri,


You also need to pass source_uri of the input vectors (tmp_features_file)

ktsitsi added 8 commits November 8, 2023 01:55

Create resnet embeddings for bioimage patches during ingestion

d32a1f6

Test udf inside ingestion wrapper

2c3864a

Fix support for svs files

f6397aa

Refactor types

2e27ab9

Change dataclass to enum

180c4e9

Test udf function

882e549

Move code to discover modules

5cdbdcb

Tested version udf

96596f9

ktsitsi requested review from thetorpedodog, ihnorton and Shelnutt2 November 9, 2023 14:44

ktsitsi added 2 commits November 9, 2023 16:51

Fix miscs

c2fb238

Formatting

5489943

ktsitsi added a commit that referenced this pull request Nov 14, 2023

Separating supported extensions from #482

8d88d0a

ktsitsi mentioned this pull request Nov 14, 2023

Separating supported extensions from #482 #484

Merged

ktsitsi added a commit that referenced this pull request Nov 14, 2023

Separating supported extensions from #482

ac9ad1d

thetorpedodog reviewed Nov 14, 2023

View reviewed changes

ktsitsi added a commit that referenced this pull request Nov 16, 2023

Separating supported extensions from #482

f24dbad

ktsitsi added a commit that referenced this pull request Nov 24, 2023

Separating supported extensions from #482 (#484)

6ab24e0

* Separating supported extensions from #482

ihnorton requested a review from thetorpedodog December 7, 2023 16:15

Merge remote-tracking branch 'origin/main' into kt/sc-33097/add-optio…

79c5e0c

…n-for-embedding-generation-upon-ingest

ihnorton reviewed Dec 12, 2023

View reviewed changes

ihnorton requested a review from NikolaosPapailiou January 10, 2024 14:18

ihnorton assigned ktsitsi and NikolaosPapailiou Jan 10, 2024

NikolaosPapailiou reviewed Jan 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for embedding generation upon BioImage Ingestion #482

Add option for embedding generation upon BioImage Ingestion #482

ktsitsi commented Nov 9, 2023

shortcut-integration bot commented Nov 9, 2023

thetorpedodog Nov 14, 2023

thetorpedodog Nov 14, 2023

ihnorton Dec 12, 2023

ihnorton Dec 12, 2023

NikolaosPapailiou Jan 12, 2024

ktsitsi Jan 12, 2024

NikolaosPapailiou Jan 12, 2024

Add option for embedding generation upon BioImage Ingestion #482

Are you sure you want to change the base?

Add option for embedding generation upon BioImage Ingestion #482

Conversation

ktsitsi commented Nov 9, 2023

shortcut-integration bot commented Nov 9, 2023

thetorpedodog Nov 14, 2023

Choose a reason for hiding this comment

thetorpedodog Nov 14, 2023

Choose a reason for hiding this comment

ihnorton Dec 12, 2023

Choose a reason for hiding this comment

ihnorton Dec 12, 2023

Choose a reason for hiding this comment

NikolaosPapailiou Jan 12, 2024

Choose a reason for hiding this comment

ktsitsi Jan 12, 2024

Choose a reason for hiding this comment

NikolaosPapailiou Jan 12, 2024

Choose a reason for hiding this comment