Refactor the inner data abstraction #552

DinoBektesevic · 2024-04-08T16:34:44Z

DinoBektesevic
Apr 8, 2024
Maintainer

right now to write what is perceived as a simple thing sometimes we have to jump through inordinate amount of hoops.

For example:

images = [img.get_science().image for img in work_unit.image_stack.get_images()]

because internally (in the core search, in C++) we wrap every array in RawImage, we stack those into LayeredImages, we then collect those into ImageStacks which go to StackSearch to get run. Anything we add in Python, like for example the work unit, is then one wrapper above all of this.

Then we have to peel the onion - from work unit to image stack, to layered image, to raw image, to underlying array. Much of the C++ functionality can be simplified, since the transition to Eigen, and we only really need one core data abstraction. A namespace one - like the StackSearch - that doesn't need to be availible from Python but just collects some functionality that helps manage and moderate GPU communication and execution, and one that holds the data.

My original preference was to just have one large ImageStack that has lists of arrays for science, mask, variance images, PSFs, WCSs and timestamps. The rest of the functionality can be implemented in terms of collections of Eigen/numpy arrays and lists/arrays of supporting data (f.e. timestamps), which are preferentially eigen row vectors in C++ so they map to numpy row vectors in Python and let us do things like list and boolean indexing and vectorized operations.

Then Steven brought up a good point about how he wanted to just directly set Psi/Phi, so now I'm thinking perhaps the derivation of those values should all be in Python and the core search is just those three lists - obstime, phi psi. Then the WorkUnit can jsut build the two arrays python-side.

maxwest-uw · 2024-04-08T19:37:50Z

maxwest-uw
Apr 8, 2024
Collaborator

personally, I'm totally down to get rid of the RawImage abstraction as I think it's kind of useless and makes things messier (plus we have to put a lot data into them that isn't really an "Image"...). But I actually think that the LayeredImage abstraction is generally pretty useful and makes the code more readable. We could make the access pattern a bit better by just having the various required arrays accessed directly through LayeredImage, like

images = [img.science for img in work_unit.image_stack.get_images()]

I was specifically thinking about this in regards to issues with re projection and data provenance (which we should discuss at this week's meeting), as we'll probably need to have a couple more elements of LayeredImage (possibly another array to store approximate original pixel locations after reprojection, as well as the footprint of multiple images when we combine images of the same obstime).

0 replies

DinoBektesevic · 2024-04-15T16:05:27Z

DinoBektesevic
Apr 15, 2024
Maintainer Author

Ok, so I agree that having an access to an abstraction that peers into the arrays on WU is good. To me that sounds like a:

@dataclass
class LayeredImage:
    science : np.array
    mask : np.array
    variance : np.array
    psf : np.array
    obstime : float

or a struct on C++ side:

struct LayeredImage {
  Eigen::Matrix science;
  Eigen::Matrix mask;
  Eigen::Matrix variance;
  Eigen::Matrix psf;
  double obstime;
};

Any masking, convolution etc. functions then just operate on an array with overrides for lists of arrays and layered image. All this other stuff goes:

// Basic getter functions for image data.
unsigned get_width() const { return width; }
unsigned get_height() const { return height; }
unsigned get_npixels() const { return width * height; }
double get_obstime() const { return science.get_obstime(); }
void set_obstime(double obstime) { science.set_obstime(obstime); }

// Getter functions for the data in the individual layers.
RawImage& get_science() { return science; }
RawImage& get_mask() { return mask; }
RawImage& get_variance() { return variance; }

// Getter functions for the pixels of the science and variance layers that check
// the mask layer for any set bits.
inline float get_science_pixel(const Index& idx) const 
inline float get_variance_pixel(const Index& idx) const 
inline bool contains(const Index& idx) const 

// Masking functions.
void mask_pixel(const Index& idx);
void binarize_mask(int flags_to_keep);
void union_masks(RawImage& new_mask);
void union_threshold_masking(float thresh);
void grow_mask(int steps);
void apply_mask(int flags);

// Subtracts a template image from the science layer.
void subtract_template(RawImage& sub_template);

// Setter functions for the individual layers.
void set_science(RawImage& im);
void set_mask(RawImage& im);
void set_variance(RawImage& im);

and then all of this can go:
struct Rectangle
inline Rectangle anchored_block
inline std::tuple<int, int, int> centered_range(int val, const int r, const int width)

and then none of these indexing trick and abstractions we use C++ side ever have a reason to float up to Python (which they now do so all those extensive bindings can also go away). From the transition to Eigen we can implement convolution, masking, growing etc. in a way that they become functions usable directly in Python with numpy arrays without copies. So whatever masking and convolution code we have in C++ is now available Python side too - sweet.

As it is now, we have to punt it all to C++ object this and that, to do an operation, then unpack to the attribute we want to get the result of the operation so we can then further work with them as numpy arrays because we don't have an effective tool to use these classes in the way C++ compiler enables us to. The impracticality of this is my primary annoyance.

I guess I could see that it's easier to implement WU in terms of a list of layered images than three independent collections of arrays, So I could take that bit back, but we could also just implement a getter for LayeredImage from WU and an iterator, and make iteration over WU, return very cheap LayeredImage dataclasses. I don't think element-access actually requires any of the bulk we have just to support this class.

1 reply

maxwest-uw Apr 15, 2024
Collaborator

I like dataclasses so I'm all for that approach, and the c++ version also seems much nicer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the inner data abstraction #552

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Refactor the inner data abstraction #552

DinoBektesevic Apr 8, 2024 Maintainer

Replies: 2 comments · 1 reply

maxwest-uw Apr 8, 2024 Collaborator

DinoBektesevic Apr 15, 2024 Maintainer Author

maxwest-uw Apr 15, 2024 Collaborator

DinoBektesevic
Apr 8, 2024
Maintainer

Replies: 2 comments 1 reply

maxwest-uw
Apr 8, 2024
Collaborator

DinoBektesevic
Apr 15, 2024
Maintainer Author

maxwest-uw Apr 15, 2024
Collaborator