Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for embedding generation upon BioImage Ingestion #482

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ktsitsi
Copy link
Collaborator

@ktsitsi ktsitsi commented Nov 9, 2023

This PR:

  • Introduces an extra functionality inside the ingestion process of biomedical images. The new ingestion function on the client side as well as the UDF has been enhanced with extra arguments that enable the creation of image embeddings during the ingestion
  • The newly added ingestion arguments are the following:
    :param embedding_model: The model to be used for creating embedding. Supported values are of type class EMBEDDINGS
    :param embedding_level: The resolution level to be used for the embedding. This could be different from the ingestion level
    selected with parameter `level`
    :param embedding_grid: A tuple that represents the (num_of_rows, num_of_cols), in which the image will be split in patches
    for the embedding creation. According to this grid internally the image is being split to fit this requirement.
  • Currently the supported model is RESNET for the embeddings, but with the introduced types abstraction we are able to enhance this functionality in the future.
  • Another type SupportedExtensions has been added concerning the file extensions we support with our ingestion function.
  • This additional functionality changes the Data model specification we have set and thus updates the stored image groups to fmt_version = 3

The resulted image is shown below:
Screenshot 2023-11-09 at 3 45 44 PM

Building a local docker image of the UDFs I was able to test it and the results are shown below as well:
Screenshot 2023-11-09 at 3 45 33 PM

This PR should be followed by PR's that will do the following and will be linked upon creation here:

  • Update the python-udf-imaging dockerfile in TileDB-REST-UDF-DOCKER-IMAGES with the dependencies needed as tiledb-vector-search and tensorflow
  • Also update the tiledb-cloud version in UDFs to match the one after the release of this PR, so functions and modules are being found on the UDF side

Copy link

This pull request has been linked to Shortcut Story #33097: Add option for embedding generation upon ingest.

Comment on lines +2 to +5
from .helpers import batch
from .helpers import get_embeddings_uris
from .helpers import scale_calc
from .helpers import serialize_filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these helpers don’t seem like they would be public-facing.


def get_embeddings_uris(output_file_uri: str) -> Tuple[str, str]:
destination = os.path.dirname(output_file_uri)
filename = os.path.basename(output_file_uri).split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out os.path.splitext

ktsitsi added a commit that referenced this pull request Nov 16, 2023
ktsitsi added a commit that referenced this pull request Nov 24, 2023
* Separating supported extensions from #482
filtered = split_in_patches(image, embedding_level, embedding_grid)

patches_array = np.array([])
for patch in filtered:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also seems slow..)

region_width = level_shape_w // grid_col_num
# Loop through the image and extract each region
patches = []
for i in range(grid_row_num):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is going to be very slow?

np.array(embeddings.shape, dtype="uint32").tofile(f)
np.array(embeddings).astype("float32").tofile(f)

vs.ingest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit the use case that you want to support?

From what I understand this is creating one vector index per image, storing the embedding of different patches of the image. Does this mean that you want to find similar patches within the image? Do you want to do any cross image queries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to find similar patches across multiple images. We don't have any specific customer request for any of that. The idea behind the patches though is that in this data environment is quite unlikely to find similarities across whole images. Each image is quite different from one another as a whole based on the depicted cell. Also from my knowledge due to the blank space outside the cells captured by sensor it is quite possible to introduce a lot of noise in the model if fed to it as a whole. So my thought was that a future user would like to search for similarities across images of a specific region of the query image, or in a next step to be able to query a region from the viewer and find similar "abnormalities" - that he can justify and I can't- across the images.


vs.ingest(
index_type="FLAT",
index_uri=embeddings_flat_uri,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to pass source_uri of the input vectors (tmp_features_file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants