-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running inference and computing test metrics on deepseg_lesion
data
#22
Comments
Following these steps, here are the resulting anima metrics: Thoughts Despite not being trained on partial view images (deepseg_lesion contains either cervical or thoracic subjects) and multiple contrasts, the @jqmcginnis what are your observations and thoughts on the next steps? |
Thank you very much for computing the results on the deepseg_sc/deepseg_lesion test set! I am relieved that the results are going in the right direction. Also, I am quite intrigued by the fact that the models performs better on the t2s images than axial images for the lesion segmentation. I would not have expected this. While we observed good generalization on the bavaria-quebec test set for the joint sc/lesion segmentation and single task segmetation models, I am curious if this is the case for the deepseg data as well. Perhaps, we can run some tests here, as well, to assess if our assumptions are correct: (1) Segmentation Model - Cord only I will provide you with these models. Moreover, I talked to a colleague of mine (Hendrik), who has been successfully applying nn-unet in different scenarios, and he has made some (code) adjustments, particularly for the data augmentation and the patch size parameters when he trains his models. According to him, choosing the wrong patch size (which may be frequently the case), will definitely impact the performance; and as we are predicting on chunks - and not holospinal images, this may have a big impact. Thus, once we have some intuition if the single-class segmentation models perform worse/better, I would consult him and do some testing with different patch sizes, and perhaps other modifications he has incorporated. Do we have access to the old training data as well? This might also be an angle worth considering. |
Just to clear this out -- the results are just from the test set of
I don't think we can compare it this way -- the number of subjects are not the same, if we had more axial images in the test set then we could have seen a better Dice score?
This is true but, have you tested it on other (possibly) whole spine or axial images from your in-house datasets?
This is absolutely true! I am observing meaningful difference in generalization performance just by changing the patch sizes, which confirms that it is indeed a crucial parameter. BUT, the question is, how do you decide on an another patch size? It seems that nnUNet uses a good heuristic -- the median size of the images -- which could make sense if you think about it. Any other patch size seems randomly chosen (I did that certainly) which might be hard to justify. In terms of data augmentation, would you know if he's: (1) adding more augmentations or removing them, OR, (2) changing the probabilities of the existing transformations? Because it seems that augmentations that EDIT:
Yes we have access to the training data for |
Good observation, that was a bit too speculative 😅
I have run the nn-unet on a multi-timepoint cohort of 416 people (or even more - that's just the subjects that made it past other selection criteria). However, I need to check again to be safe if there is an overlap between the training set and the cohort. If so, I would be surprised and it should be very small. However, we do not have a "hard" GT here - only manually corrected segmentation masks from previously segmented nn-unet models.
I would use it as a hyperparameter on a hold-out validation set. I think often the patch size is chosen too large.
I think we should try both, i.e. adding of other contrasts but also iterating on the current model without the deepseg_leison data - at the moment, we can check how well the model generalizes to other datasets when we keep the current split scenario. |
Wait, if we ultimately are going to add |
I think we can do a combination of both, it's just I would like to ensure we know what exactly we can attribute the model's success/performance to. |
This issue documents the prerequisite steps for testing the
bavaria-quebec
model on the dataset used forsct_deepseg_lesion
model. Following the information in spinalcordtoolbox/deepseg_lesion_models#2 (comment), gather the dataset with images, sc-seg labels and lesion-seg labels.The remaining steps are as follows:
bavaria-quebec
model was trained on RPI images, all images indeepseg_lesion
have to be converted to RPI also. Run the commandfor file in *.nii.gz;do sct_image -i ${file} -setorient RPI -o ${file}; done
from the root directory.The text was updated successfully, but these errors were encountered: