-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for embedding generation upon BioImage Ingestion #482
base: main
Are you sure you want to change the base?
Add option for embedding generation upon BioImage Ingestion #482
Conversation
This pull request has been linked to Shortcut Story #33097: Add option for embedding generation upon ingest. |
from .helpers import batch | ||
from .helpers import get_embeddings_uris | ||
from .helpers import scale_calc | ||
from .helpers import serialize_filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these helpers don’t seem like they would be public-facing.
|
||
def get_embeddings_uris(output_file_uri: str) -> Tuple[str, str]: | ||
destination = os.path.dirname(output_file_uri) | ||
filename = os.path.basename(output_file_uri).split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out os.path.splitext
…n-for-embedding-generation-upon-ingest
filtered = split_in_patches(image, embedding_level, embedding_grid) | ||
|
||
patches_array = np.array([]) | ||
for patch in filtered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also seems slow..)
region_width = level_shape_w // grid_col_num | ||
# Loop through the image and extract each region | ||
patches = [] | ||
for i in range(grid_row_num): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it is going to be very slow?
np.array(embeddings.shape, dtype="uint32").tofile(f) | ||
np.array(embeddings).astype("float32").tofile(f) | ||
|
||
vs.ingest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a bit the use case that you want to support?
From what I understand this is creating one vector index per image, storing the embedding of different patches of the image. Does this mean that you want to find similar patches within the image? Do you want to do any cross image queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to find similar patches across multiple images. We don't have any specific customer request for any of that. The idea behind the patches though is that in this data environment is quite unlikely to find similarities across whole images. Each image is quite different from one another as a whole based on the depicted cell. Also from my knowledge due to the blank space outside the cells captured by sensor it is quite possible to introduce a lot of noise in the model if fed to it as a whole. So my thought was that a future user would like to search for similarities across images of a specific region of the query image, or in a next step to be able to query a region from the viewer and find similar "abnormalities" - that he can justify and I can't- across the images.
|
||
vs.ingest( | ||
index_type="FLAT", | ||
index_uri=embeddings_flat_uri, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to pass source_uri
of the input vectors (tmp_features_file
)
This PR:
types
abstraction we are able to enhance this functionality in the future.SupportedExtensions
has been added concerning the file extensions we support with our ingestion function.fmt_version = 3
The resulted image is shown below:
Building a local docker image of the UDFs I was able to test it and the results are shown below as well:
This PR should be followed by PR's that will do the following and will be linked upon creation here:
python-udf-imaging
dockerfile in TileDB-REST-UDF-DOCKER-IMAGES with the dependencies needed astiledb-vector-search
andtensorflow