Replies: 3 comments 4 replies
-
It can do both. The tutorials show the direct sampling approach. The on-disk chipping is available via the
Happy to answer any questions. There a few other similar libraries as well. You might also want to check out this comparison: https://torchgeo.readthedocs.io/en/latest/user/alternatives.html. |
Beta Was this translation helpful? Give feedback.
-
GeoWATCH OverviewThis is an overview of GeoWATCH framework and its relation to RasterVision as of 2024-01-16. In 2020 Kitware won a contract to compete in the IARPA SMART program. The SMART program challenged 5 teams to design algorithms for utilizing satellite imagery from multiple sensors (e.g., Sentinel2, Landsat-8, Worldview, Planetscope) to search over multiple large areas of space and time (e.g., 1000 square kilometers over 8 years) for events of interest. The driving use case was heavy construction (e.g., construction of buildings more than 8000 square kilometers). The program also evaluated performers on other tasks such as detecting transient events (e.g. burning man, renaissance fairs) encouraged teams to look for solutions that were generalizable. The input to the problem is a SpatioTemporal Asset Catalog (STAC) catalog that indexes the images available for use. The goal is to ingest data from a region of interest, and predict geojson "sites summaries": polygons with a start and end date that should contain the "event of interest". Additionally, each "site" is sub-classified into phases of the event (i.e. for heavy construction the phases are site-preparation and active construction). Kitware, and team members from University of Connecticut, Washington University, Rutgers, and DZYNE Technologies developed GeoWATCH as its solution. The initial pitch was that each subteam would develop a semantically meaningful raster feature, and then Kitware would develop a "fusion" module that combined these features together, producing a heatmap for every (selected) input image in the sequence. The heatmaps are trained to be bright when an event occurs and dark otherwise. We extract polygons from these heatmaps using a "tracker module" to produce the final geojson output. KWCOCO Data InterchangeTo facilitate data interchange between the teams in a machine-learning friendly way, we expanded the kwcoco module with additional features to support large multispectral satellite images. The kwcoco module itself, uses the MS-COCO format as a starting point. We augmented each "image" dictionary to be able to reference one or more "assets", which each correspond to a file on disk. Each asset contains a channel code as well as a transform (usually affine) that warps the pixels in the asset into an aligned "image view" or "image space". Similarly, multiple images can be registered to a "video", and another transform can be specified that aligns the images in a "video view" or "video space". More details on warping and space (which we are going to rename views) can be found in the KWCOCO Spaces document. The KWCOCO spec isn't specifically for geospatial data, but it allows user-defined keys, so in the geowatch tooling, we define kwcoco extensions that populate each image with fields like "sensor_coarse", and uses geotiff metadata to infer the transforms. To create the initial base KWCOCO for a region, we use a pipeline that pulls the STAC catalog that points to processed large image tiles, creates a virtual uncropped kwcoco dataset that points to the large image tiles (using the GDAL vitual filesystem), and then crops out (othorectifing if necessary) a set of images that are aligned up to an affine transform. The idea is that we are going to try and delay as much resampling until the very last minute so we can leverage fused affine transforms. In production, datasets are constructed on the fly, but for development, we gather kwcoco datasets for each region and push them to members of the development team using DVC. Semantically Rich FeaturesBecause the KWCOCO format allows an image to reference multiple assets, we define an API such that the input to each module is a KWCOCO file, and the output is a new KWCOCO file. In this way we can compute a kwcoco where each image points to both its original sensed assets as well as the semantically rich features. Some of the features our team members contributed generate rasters for: landcover, depth, COLD, materials features, SAM-features, MAE-features, and time-invariant features. For example, if we wanted to enrich a kwcoco file with an additional feature (e.g. features from SegmentAnything), we would run a command like: python -m geowatch.tasks.sam.predict \
--input_kwcoco "/path/to/input.kwcoco.zip" \
--output_kwcoco "/path/to/output.kwcoco.zip" \
--weights_fpath "/path/to/models/sam/sam_vit_h_4b8939.pth" Training PipelineGiven a kwcoco file with (or without) enriched features, we can now train a "fusion" model. Our training backend is implemented with a customized LightningCLI. The main components are the KWCocoDataModule, the KWCocoDataset, and the MultimodalTransformer.
A typical configuration will use a To sample data efficiently, we abstract the sampling of data via the ndsampler API, which uses delayed-image under the hood to translate the "virtual coordinate system" visible to the developer to the correct locations in the larger images. As long as the images are COGs this is reasonably fast. For annotation sampling, ndsampler builds an rtree to lookup annotations that overlap a target region. When training a model, we check if the hash of the kwcoco file has been seen before. If not, then we compute and cache dataset statistics like mean/std for each modality and class frequencies. Based on the configuration of the KWCocoDataset, the sample grid can be spatially regular, or centered on specific annotations. Similarly, the "time kernel" controls the time steps selected for each spatial location. This defines a large flat list of grid "targets", which we organize in a tree in order to balance sampling of data (e.g. positive / negative regions). Our Lightning trainer is fairly standard with a few extensions. First, we have sophisticated batch item visualization that provides the developer with real time feedback on qualitative performance. An example of this visualization for a training run that used the following sensor/channel configuration:
This image gives a summary of much of the information provided in the batch including: truth heatmaps, truth bounding boxes, per-task pixelwise weights, and selected bands from the underlying imagery. Also notable in the above data is some of the images have checkerboard patterns. This represents NODATA regions. These are maintained as nans in the tensors all the way up to the network forward pass, at which point we subtract the mean and divide by the std, and then zero the nans, which means that the nan values are always imputed as the mean of the datasets. In the above sensorchan spec, the pipe separated channels early fused channels, for each frame all of these channels are stacked into a single tensor that is passed through a sensor-specific ConvNet to normalize the number of channels (we literally maintain a dictionary that maps a sensorchan code to a specific stem). Then we tokenize these channel-normalized features, add positional encodings, stack them, and send them through the transformer. At the end we pool activations from timesteps that have multiple sensors and pass them to task specific heads, which produce heatmaps aligned to the inputs (although in the future we plan on adding a decoder to ask for predictions at unobserved times). Given the outputs, the network computes the loss and then lightning does its thing. A rough illustration of the network looks like this: Additional interesting training capabilities we have is a partial implementation of loss-of-plasticity. We also have the ability to initialize a network from another one that is similar, but may have different numbers of layers / heads / stems, using partial-weight-loading, which maps weights from one network to another by finding a maximal subtree isomorphism. This has been critical to continue training our networks over a long time and changing the feature configurations. We have observed that after models are improved by training on semantically rich features, we can drop those features and retrain a new network that retains some of the old performance. In other words, the heavyweight features seem to be "instilled" into the network. Prediction PipelineAfter a model is trained, we use torch.package to build a model bundle that contains its training configuration, model code, and weights. The idea is that we should be able to pass this model to our prediction script, and have all train time configurations (e.g. batch sampling) inferred by the predict script as defaults. The predict script itself will run a model over a sliding window and stitch the heatmaps back into a larger raster as illustrated: Software TestingGeoWATCH places a much larger emphasis on testing than the average research repository. To enable testing we've developed "kwcoco toydata", which can produce demo kwcoco dataset for object detection / tracking / segmentation / classification problems. It can generate dummy MSI imagery and has several knobs that can be configured. A sample RGB visualization looks like this: For GeoWATCH itself, we sometimes need geo-referenced data and not just image data, and for this geowatch extends kwcoco demodata to add these additional fields. Additionally, many other data structures defined in geowatch and other supporting libraries come equipped with a While there are some unit tests, most of the testing is done via doctests and run with xdoctest. MLOpsTo evaluate our systems over a parameter grid, we've written an mlops system to define prediction pipelines, and run them over a grid of parameters (using a github actions-like YAML configuration). The basic pipeline structure has the user define the paths a process is expected to take as inputs and produce as outputs. Outputs of one process can be connected as inputs of another without the user needing to manually specify them. Only unconnected inputs need to be given or non-default configuration variables must be specified. The user specifies the relative name for each output file, but the mlops system chooses the directory the outputs will be written to. It does this using a hashed directory structure, which lets it determine if a process has completed or not and causes changes in pipeline configurations to only cause new results to be recomputed. To make navigation of this directory structure easier, each node's output folder is equipped with a symlinks to its predecessor nodes that it depends on as well as its successor nodes that depend on it. The system assumes that all processes are invokable as a bash script (i.e. there is a CLI for each operation a user might want), which is a key design decision. This allows the mlops system to only be concerned about generating the right bash invocations to run a pipeline. In each output node we write an "invoke.sh" script which provides the bash invocation used to compute the nodes results. This has been instrumental when debugging. The bash-script assumption also means that we can abstract how a pipeline or DAG is run. We do this via the cmd_queue module. To use this module the user creates a queue and then submits job as a bash command in the form of a string as well as references to the jobs that it depends on. The actual execution of the jobs is abstracted by one of three (perhaps soon to be four) backends:
Relationship to RasterVisionOur dataloader automatically computes mean/std of input dataset as well as class frequency. This seems similar to the "ANALYZE" step in RasterVision. Something GeoWATCH does not yet do is allow the user to specify the mean/std or frequency statistics so training is not forced to compute those. Our virtual sample grid seems to corresponds to "CHIP" in the RasterVision pipeline. Raster visions direct sampling seems to correspond to what we can do with ndsampler. We are going to run some tests further compare them and see which if one is faster than the other. GeoWATCH doesn't have the ability to pre-chip data, but if you can afford the preprocessing it will likely be faster than sampling directly from COGs, although it does limit the translation augmentation that can be done by the dataloader. For the "TRAIN" step it seems like both frameworks settled on Lightning, so porting our callbacks for use in RasterVision shouldn't be too hard. Something that is nice about like about how geowatch invokes LightningCLI is that it can specify the entire config inline in bash. Our tutorial 1 shows an example of this. This requires a small hack to make it work. RasterVision uses pydantic for configuration, whereas we use (what is a less popular but more flexible tool) scriptconfig. This also requires some monkeypatches on top of jsonargparse to make it work, but my hope is that I can upstream some of those changes so pydantic and scriptconfig based configs can both be used. For PREDICT it seems both frameworks have similar strategies of incrementally stitching together heatmap predictions from batches. For vector outputs such as bounding boxes, the main GeoWATCH fusion tool doesn't work with it yet, but it is in development and it will work similarly to our implementation of a DINO box predictor, where detections are accumulated and non-max suppressed. Note that our implementation of non-max suppression and other efficient annotation data structures are powered by a standalone library kwimage. Something we've strived for in building these tools is to modularize them into separate Python modules with fewer dependencies, so it is easier to re-use or re-purpose them in other libraries. For EVAL, we have object detection and pixelwise segmentation metrics, as well as official metrics code which was provided to us by IARPA. Currently object detection metrics live in kwcoco, and the plan is to port the pixelwise segmentation metrics there as well. A good deal of work has gone into making them efficient, so it will be interesting to compare implementations. For BUNDLE, it looks like both frameworks again have similar solutions. I'm glad others have realized how important this is. We use torch.package to bundle the code and the weights. One tweak we needed to make is to include a package header so the predict script knows the name of the module that is packaged. |
Beta Was this translation helpful? Give feedback.
-
Hi, since this is something that had come up in our discussion, I wanted to give an update: the new v0.30 release no longer requires exact versions of dependencies. (They're still there in the |
Beta Was this translation helpful? Give feedback.
-
I'm attempting to understand RasterVision, and I'm curious how it handles the chipping of images. Does it write the smaller images to disk, or does it leverage COGs to subsample from larger images?
I've been working on a similar open source library (geowatch), and I'm wondering how much shared ground we've tread.
EDIT: I've changed the name of this thread from "RasterVision Internals - Is chipping done on disk?" to "RasterVision Comparison To GeoWATCH".
Beta Was this translation helpful? Give feedback.
All reactions