RasterVision Comparison To GeoWATCH #2029

Erotemic · 2024-01-12T01:57:47Z

Erotemic
Jan 12, 2024

I'm attempting to understand RasterVision, and I'm curious how it handles the chipping of images. Does it write the smaller images to disk, or does it leverage COGs to subsample from larger images?

I've been working on a similar open source library (geowatch), and I'm wondering how much shared ground we've tread.

EDIT: I've changed the name of this thread from "RasterVision Internals - Is chipping done on disk?" to "RasterVision Comparison To GeoWATCH".

AdeelH · 2024-01-12T14:29:40Z

AdeelH
Jan 12, 2024
Maintainer

It can do both. The tutorials show the direct sampling approach. The on-disk chipping is available via the chip command when using Raster Vision as a framework.

I've been working on a similar open source library (geowatch), and I'm wondering how much shared ground we've tread.

Happy to answer any questions. There a few other similar libraries as well. You might also want to check out this comparison: https://torchgeo.readthedocs.io/en/latest/user/alternatives.html.

4 replies

Erotemic Jan 12, 2024
Author

The alternative page is very useful, thank you.

It seems like raster vision has a lot of overlap with what I've done. One thing that might be novel to my library is working with heterogeneous time series. We can reference a window of data in a virtual coordinate system, but then when we sample it we can use the native resolution (or some scale factor of the native resolution) for each time step. Each time step may also have different bands. For example, we might sample RGB from a 2mGSD worldview image, 6-band MSI from 30m Landsat8 images, and 6 band MSI from 10m Sentinel2 images. Input stems normalize the input channel dimensions, and positional encodings align the tokens fed to the underlying VIT.

Can RasterView handle this use-case. If not, is that something that might be valuable?

We also have a custom formal gramar that lets the user concisely specify the sensor / band configuration they are interested in. The above configuration would look like:

((S2,L8):blue|green|red|nir|swir16|swir22),(WV:red|green|blue)

The above features are powered by delayed-image. This tool lets the user build a tree of resampling operations, which then facilitates optimization (merging of linear transforms and replacing downsamples with overview selection), and the "virtual views", which can then have scaling components undone to sample the underlying data at its native resolution.

My thought is that both RasterVision and GeoWATCH could benefit from collaberating with each other. There are a lot of features in RasterVision seem useful to me, and I would be willing to help port some of the work I've done to it. I'm just trying to figure out where the intersecting / disjoint areas are.

AdeelH Jan 16, 2024
Maintainer

Thanks for the details! The formal grammar looks cool. We definitely don't have that.

Can RasterView handle this use-case. If not, is that something that might be valuable?

RV does indeed have the ability to combine bands from different files, with different resolutions etc., via what we call MultiRasterSources. We also have TemporalMultiRasterSources, which are basically the same thing, but stack bands along a new time dimension instead of the channel dimension. And each sub-RasterSource within a TemporalMultiRasterSource can be a MultiRasterSource that combines bands from different sources, which is how you would handle the use case you described using RV.

There is no handling of positional encodings, however.

My thought is that both RasterVision and GeoWATCH could benefit from collaberating with each other. There are a lot of features in RasterVision seem useful to me, and I would be willing to help port some of the work I've done to it. I'm just trying to figure out where the intersecting / disjoint areas are.

That would be wonderful! We are always open to new contributions and collaborations. I currently have the same problem of figuring out the points of intersection from the RV end, but I'll try to take a closer look at GeoWATCH and delayed-image and figure things out. Please feel free to point me to docs or other resources you think would be helpful. For RV, I would recommend checking out Usage Overview, Basic Concepts, and the tutorial notebooks.

Erotemic Jan 16, 2024
Author

I'm glad to hear that. Looking at Azavea's web page, its goal developing open source geospatial technology to make a positive societal impact is very much in line with what we are trying to do with GeoWATCH at Kitware. I'm wondering if you have time to have a video chat for an hour or so and discuss the different frameworks. My email is [email protected]

I've taken a peek at the overview and concepts, and its quite compelling. I have some short-term tasks that I need to finish this week, but I have a member on our team attempting to define a simplified version of our problem in rastervision so we can compare and constrast better. Once I finish my dev work for the week, I'll also be digging into the tutorials.

I've also taken some time to writeup an overview GeoWATCH and and how I currently see it in relationship to RasterVision. My understanding of RasterVision is still limited, so I may have some misunderstandings, but the focus is mostly on GeoWATCH. It's a little longer so I'm going to paste it in as a separate comment.

AdeelH Jan 18, 2024
Maintainer

Great! Thanks for the write-up! I've sent over an email (from [email protected]). Looking forward to connecting.

Erotemic · 2024-01-16T22:31:24Z

Erotemic
Jan 16, 2024
Author

GeoWATCH Overview

This is an overview of GeoWATCH framework and its relation to RasterVision as of 2024-01-16.

In 2020 Kitware won a contract to compete in the IARPA SMART program. The SMART program challenged 5 teams to design algorithms for utilizing satellite imagery from multiple sensors (e.g., Sentinel2, Landsat-8, Worldview, Planetscope) to search over multiple large areas of space and time (e.g., 1000 square kilometers over 8 years) for events of interest. The driving use case was heavy construction (e.g., construction of buildings more than 8000 square kilometers). The program also evaluated performers on other tasks such as detecting transient events (e.g. burning man, renaissance fairs) encouraged teams to look for solutions that were generalizable.

The input to the problem is a SpatioTemporal Asset Catalog (STAC) catalog that indexes the images available for use. The goal is to ingest data from a region of interest, and predict geojson "sites summaries": polygons with a start and end date that should contain the "event of interest". Additionally, each "site" is sub-classified into phases of the event (i.e. for heavy construction the phases are site-preparation and active construction).

Kitware, and team members from University of Connecticut, Washington University, Rutgers, and DZYNE Technologies developed GeoWATCH as its solution. The initial pitch was that each subteam would develop a semantically meaningful raster feature, and then Kitware would develop a "fusion" module that combined these features together, producing a heatmap for every (selected) input image in the sequence. The heatmaps are trained to be bright when an event occurs and dark otherwise. We extract polygons from these heatmaps using a "tracker module" to produce the final geojson output.

KWCOCO Data Interchange

To facilitate data interchange between the teams in a machine-learning friendly way, we expanded the kwcoco module with additional features to support large multispectral satellite images. The kwcoco module itself, uses the MS-COCO format as a starting point. We augmented each "image" dictionary to be able to reference one or more "assets", which each correspond to a file on disk. Each asset contains a channel code as well as a transform (usually affine) that warps the pixels in the asset into an aligned "image view" or "image space". Similarly, multiple images can be registered to a "video", and another transform can be specified that aligns the images in a "video view" or "video space". More details on warping and space (which we are going to rename views) can be found in the KWCOCO Spaces document. The KWCOCO spec isn't specifically for geospatial data, but it allows user-defined keys, so in the geowatch tooling, we define kwcoco extensions that populate each image with fields like "sensor_coarse", and uses geotiff metadata to infer the transforms. To create the initial base KWCOCO for a region, we use a pipeline that pulls the STAC catalog that points to processed large image tiles, creates a virtual uncropped kwcoco dataset that points to the large image tiles (using the GDAL vitual filesystem), and then crops out (othorectifing if necessary) a set of images that are aligned up to an affine transform. The idea is that we are going to try and delay as much resampling until the very last minute so we can leverage fused affine transforms. In production, datasets are constructed on the fly, but for development, we gather kwcoco datasets for each region and push them to members of the development team using DVC.

Semantically Rich Features

Because the KWCOCO format allows an image to reference multiple assets, we define an API such that the input to each module is a KWCOCO file, and the output is a new KWCOCO file. In this way we can compute a kwcoco where each image points to both its original sensed assets as well as the semantically rich features. Some of the features our team members contributed generate rasters for: landcover, depth, COLD, materials features, SAM-features, MAE-features, and time-invariant features.

For example, if we wanted to enrich a kwcoco file with an additional feature (e.g. features from SegmentAnything), we would run a command like:

python -m geowatch.tasks.sam.predict \
    --input_kwcoco "/path/to/input.kwcoco.zip" \
    --output_kwcoco "/path/to/output.kwcoco.zip" \
    --weights_fpath "/path/to/models/sam/sam_vit_h_4b8939.pth"

Training Pipeline

Given a kwcoco file with (or without) enriched features, we can now train a "fusion" model. Our training backend is implemented with a customized LightningCLI. The main components are the KWCocoDataModule, the KWCocoDataset, and the MultimodalTransformer.
To train a fusion model you specify:

A path to an input kwcoco file
The size of the spatial window for each batch
A "time kernel", which defines how images will be sampled over
time to construct a batch
A sensor-channel code, which defines which sensors / channels will
be fed to the network as input

A typical configuration will use a 196 x 196 @ 2m GSD spatial window (note that the sampling resolution is specified here, and our dataloader will efficiently resample data at that resolution), and 7 time samples spread over 2 years. While this is the basic case, we can do more complex sampling where each image is sampled in its native resolution based on a common spatial window, but we are still working on making that work better. However, in order to ensure we can handle this case we define a very particular output of our dataloader where we do not collate batch items. Instead it is up to the network to determine how to tokenize structure data in the forward pass. Our main network network processes batch items 1 at a time because each batch item may contain a different number of tokens. Tokens are enriched with positional encodings that specify relative space/time location, which sensor the image is from, and which channel the modality is from.

To sample data efficiently, we abstract the sampling of data via the ndsampler API, which uses delayed-image under the hood to translate the "virtual coordinate system" visible to the developer to the correct locations in the larger images. As long as the images are COGs this is reasonably fast. For annotation sampling, ndsampler builds an rtree to lookup annotations that overlap a target region.

When training a model, we check if the hash of the kwcoco file has been seen before. If not, then we compute and cache dataset statistics like mean/std for each modality and class frequencies. Based on the configuration of the KWCocoDataset, the sample grid can be spatially regular, or centered on specific annotations. Similarly, the "time kernel" controls the time steps selected for each spatial location. This defines a large flat list of grid "targets", which we organize in a tree in order to balance sampling of data (e.g. positive / negative regions).

Our Lightning trainer is fairly standard with a few extensions. First, we have sophisticated batch item visualization that provides the developer with real time feedback on qualitative performance. An example of this visualization for a training run that used the following sensor/channel configuration:

(L8,S2):(blue|green|red|nir),
(WV):(blue|green|red),
(WV,WV1):pan,
(S2):(water|forest|field|impervious|barren|landcover_hidden.0:32,invariants:16),
(L8):(COLD.0:36),(L8,S2,WV,WV1):(sam.0:64)

This image gives a summary of much of the information provided in the batch including: truth heatmaps, truth bounding boxes, per-task pixelwise weights, and selected bands from the underlying imagery. Also notable in the above data is some of the images have checkerboard patterns. This represents NODATA regions. These are maintained as nans in the tensors all the way up to the network forward pass, at which point we subtract the mean and divide by the std, and then zero the nans, which means that the nan values are always imputed as the mean of the datasets.

In the above sensorchan spec, the pipe separated channels early fused channels, for each frame all of these channels are stacked into a single tensor that is passed through a sensor-specific ConvNet to normalize the number of channels (we literally maintain a dictionary that maps a sensorchan code to a specific stem). Then we tokenize these channel-normalized features, add positional encodings, stack them, and send them through the transformer. At the end we pool activations from timesteps that have multiple sensors and pass them to task specific heads, which produce heatmaps aligned to the inputs (although in the future we plan on adding a decoder to ask for predictions at unobserved times). Given the outputs, the network computes the loss and then lightning does its thing.

A rough illustration of the network looks like this:

Additional interesting training capabilities we have is a partial implementation of loss-of-plasticity. We also have the ability to initialize a network from another one that is similar, but may have different numbers of layers / heads / stems, using partial-weight-loading, which maps weights from one network to another by finding a maximal subtree isomorphism. This has been critical to continue training our networks over a long time and changing the feature configurations. We have observed that after models are improved by training on semantically rich features, we can drop those features and retrain a new network that retains some of the old performance. In other words, the heavyweight features seem to be "instilled" into the network.

Prediction Pipeline

After a model is trained, we use torch.package to build a model bundle that contains its training configuration, model code, and weights. The idea is that we should be able to pass this model to our prediction script, and have all train time configurations (e.g. batch sampling) inferred by the predict script as defaults.

The predict script itself will run a model over a sliding window and stitch the heatmaps back into a larger raster as illustrated:

Software Testing

GeoWATCH places a much larger emphasis on testing than the average research repository. To enable testing we've developed "kwcoco toydata", which can produce demo kwcoco dataset for object detection / tracking / segmentation / classification problems. It can generate dummy MSI imagery and has several knobs that can be configured. A sample RGB visualization looks like this:

For GeoWATCH itself, we sometimes need geo-referenced data and not just image data, and for this geowatch extends kwcoco demodata to add these additional fields.

Additionally, many other data structures defined in geowatch and other supporting libraries come equipped with a .random() or .demo() classmethod to help create instances of them on the fly for testing.

While there are some unit tests, most of the testing is done via doctests and run with xdoctest.

MLOps

To evaluate our systems over a parameter grid, we've written an mlops system to define prediction pipelines, and run them over a grid of parameters (using a github actions-like YAML configuration).

The basic pipeline structure has the user define the paths a process is expected to take as inputs and produce as outputs. Outputs of one process can be connected as inputs of another without the user needing to manually specify them. Only unconnected inputs need to be given or non-default configuration variables must be specified. The user specifies the relative name for each output file, but the mlops system chooses the directory the outputs will be written to. It does this using a hashed directory structure, which lets it determine if a process has completed or not and causes changes in pipeline configurations to only cause new results to be recomputed. To make navigation of this directory structure easier, each node's output folder is equipped with a symlinks to its predecessor nodes that it depends on as well as its successor nodes that depend on it.

The system assumes that all processes are invokable as a bash script (i.e. there is a CLI for each operation a user might want), which is a key design decision. This allows the mlops system to only be concerned about generating the right bash invocations to run a pipeline. In each output node we write an "invoke.sh" script which provides the bash invocation used to compute the nodes results. This has been instrumental when debugging.

The bash-script assumption also means that we can abstract how a pipeline or DAG is run. We do this via the cmd_queue module. To use this module the user creates a queue and then submits job as a bash command in the form of a string as well as references to the jobs that it depends on. The actual execution of the jobs is abstracted by one of three (perhaps soon to be four) backends:

The serial backend where all commands are topologically sorted and run one by one in the current terminal. This is great for debugging and stability, but does not leverage any parallelism.
The SLURM backend, which uses the SLURM CLI to submit all jobs into a SLURM queue. This is a very powerful way of submitting jobs, but SLURM is heavyweight and can be difficult to setup correctly. Thus we have implemented a third backend
The TMUX backend. This is a lightweight custom backend which distributes jobs that can run in parallel across multiple TMUX sessions. This also lets a user attach to the sessions to watch multiple jobs simultaneously. It just statically runs a set sequence of jobs, so it doesn't maximize CPU usage like a more dynamic scheduler, but its often good enough.

Relationship to RasterVision

Our dataloader automatically computes mean/std of input dataset as well as class frequency. This seems similar to the "ANALYZE" step in RasterVision. Something GeoWATCH does not yet do is allow the user to specify the mean/std or frequency statistics so training is not forced to compute those.

Our virtual sample grid seems to corresponds to "CHIP" in the RasterVision pipeline. Raster visions direct sampling seems to correspond to what we can do with ndsampler. We are going to run some tests further compare them and see which if one is faster than the other. GeoWATCH doesn't have the ability to pre-chip data, but if you can afford the preprocessing it will likely be faster than sampling directly from COGs, although it does limit the translation augmentation that can be done by the dataloader.

For the "TRAIN" step it seems like both frameworks settled on Lightning, so porting our callbacks for use in RasterVision shouldn't be too hard.

Something that is nice about like about how geowatch invokes LightningCLI is that it can specify the entire config inline in bash. Our tutorial 1 shows an example of this. This requires a small hack to make it work.

RasterVision uses pydantic for configuration, whereas we use (what is a less popular but more flexible tool) scriptconfig. This also requires some monkeypatches on top of jsonargparse to make it work, but my hope is that I can upstream some of those changes so pydantic and scriptconfig based configs can both be used.

For PREDICT it seems both frameworks have similar strategies of incrementally stitching together heatmap predictions from batches. For vector outputs such as bounding boxes, the main GeoWATCH fusion tool doesn't work with it yet, but it is in development and it will work similarly to our implementation of a DINO box predictor, where detections are accumulated and non-max suppressed. Note that our implementation of non-max suppression and other efficient annotation data structures are powered by a standalone library kwimage. Something we've strived for in building these tools is to modularize them into separate Python modules with fewer dependencies, so it is easier to re-use or re-purpose them in other libraries.

For EVAL, we have object detection and pixelwise segmentation metrics, as well as official metrics code which was provided to us by IARPA. Currently object detection metrics live in kwcoco, and the plan is to port the pixelwise segmentation metrics there as well. A good deal of work has gone into making them efficient, so it will be interesting to compare implementations.

For BUNDLE, it looks like both frameworks again have similar solutions. I'm glad others have realized how important this is. We use torch.package to bundle the code and the weights. One tweak we needed to make is to include a package header so the predict script knows the name of the module that is packaged.

0 replies

AdeelH · 2024-04-11T14:41:39Z

AdeelH
Apr 11, 2024
Maintainer

Hi, since this is something that had come up in our discussion, I wanted to give an update: the new v0.30 release no longer requires exact versions of dependencies. (They're still there in the requirements.txts but the setup.pys ignores them.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RasterVision Comparison To GeoWATCH #2029

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RasterVision Comparison To GeoWATCH #2029

Erotemic Jan 12, 2024

Replies: 3 comments · 4 replies

AdeelH Jan 12, 2024 Maintainer

Erotemic Jan 12, 2024 Author

AdeelH Jan 16, 2024 Maintainer

Erotemic Jan 16, 2024 Author

AdeelH Jan 18, 2024 Maintainer

Erotemic Jan 16, 2024 Author

GeoWATCH Overview

KWCOCO Data Interchange

Semantically Rich Features

Training Pipeline

Prediction Pipeline

Software Testing

MLOps

Relationship to RasterVision

AdeelH Apr 11, 2024 Maintainer

Erotemic
Jan 12, 2024

Replies: 3 comments 4 replies

AdeelH
Jan 12, 2024
Maintainer

Erotemic Jan 12, 2024
Author

AdeelH Jan 16, 2024
Maintainer

Erotemic Jan 16, 2024
Author

AdeelH Jan 18, 2024
Maintainer

Erotemic
Jan 16, 2024
Author

AdeelH
Apr 11, 2024
Maintainer