Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoJSON schemas generation #1349

Closed
wants to merge 12 commits into from

Conversation

MTachon
Copy link
Contributor

@MTachon MTachon commented Sep 11, 2023

Overview

This PR builds upon PR #1022 , and intends to provide a standard way/facilitate the schema generation for providers, which publish GeoJSON data. It makes use of pydantic models, that can be used to generate the corresponding JSON schemas. Pydantic models can also help with validating incoming data (for transactions), beyond what JSON schemas allows for (e.g closed linear rings(s) for a valid GeoJSON Polygon, invalid geometry checks...) with custom validator functions.

Several helper functions are implemented, which can be used in the get_schema() method of providers:

pygeoapi/models/geojson.py:

  • create_geojson_geometry_model(): creates a pydantic model for a GeoJSON Geometry
  • create_geojson_feature_model(): creates a pydantic model for a GeoJSON Feature
  • create_geojson_feature_collection_model(): creates a pydantic model for a GeoJSON FeatureCollection

NOTE: These helper functions return pydantic models which have default validator functions. These validator functions (defined in pygeoapi/models/validators.py) check that all GeoJSON geometries of type Polygon have closed linear rings in a GeoJSON Geometry/Feature/FeatureCollection. The validator functions are called when the model_validate() method of the output pydantic models are called, and can be overwritten with the field_validators parameter of the create_geojson_geometry_model(), create_geojson_feature_model() and create_geojson_feature_collection_model() functions.

pygeoapi/schemas.py:

  • get_geojson_feature_schema(): creates the JSON schema for GeoJSON Feature
  • get_geojson_feature_collection_schema(): creates the JSON schema for GeoJSON FeatureCollection

NOTE: These are shorthand function to directly create JSON schemas generated from pydantic models. They create the appropriate pydantic models by calling one of the functions from pygeoapi/models/geojson.py, and call their model_json_schema() method. In the schema generation process, default values are removed for bbox and id properties.

The following code shows how vector data providers can implement their get_schema() method:

import datetime as dt

from pygeoapi.models.geojson import GeoJSONProperty
from pygeoapi.schemas import get_geojson_feature_schema


class MyProvider(BaseProvider):
    ...
    def get_schema():
        # Getting the fields here
        ...

        # Defining a list of GeoJSONProperty for the data published by the provider,
        # based on the field names and types obtained above
        # Note the use of the 'nullable' and 'required' parameters
        # The argument passed to 'dtype' must be a type supported by pydantic and JSON serializable
        geojson_properties = [
            GeoJSONProperty(
                name='city', dtype=str, nullable=False, required=True,
            ),
            GeoJSONProperty(
                name='population', dtype=int, nullable=False, required=False,
            ),
            GeoJSONProperty(
                name='area', dtype=float, nullable=True, required=True,
            ),
            GeoJSONProperty(
                name='db_datetime', dtype=dt.datetime, nullable=True, required=False,
            ),
        ]

        # Generate the JSON schema
        json_schema = get_geojson_feature_schema(
            properties=geojson_properties,
            geom_type='Point',
            # True/False affects how features will validate against the schema,
            # whether 'geometry' can be set to 'null' or not
            geom_nullable=False,
            n_dims=2,  # number of dimensions for 'bbox' and 'coordinates' fields
        )

        return {'application/geo+json', json_schema}

Ideally, the get_fields() method of providers could be extended so that it returns the list of GeoJSONProperty directly, or a custom get_geojson_properties() method could be used instead. Either should take care of the nullable and required parameters.

This PR also opens up for defining a get_data_model(type_: Literal['Feature', 'FeatureCollection']) abstract method in the BaseProvider, which can be implemented in providers. The get_data_model() would call one of the create_geojson_feature_model() or create_geojson_feature_collection_model() functions and return the appropriate pydantic models. For supporting feature transactions, the model_validate() method of the pydantic models can be called directly in pygeoapi.api.manage_collection_item() to validate/invalidate incoming data. The validation with pydantic models is more flexible and powerful than that of JSON schemas, as mentioned above.

Related Issue / Discussion

Additional Information

  • The implementation uses Pydantic 2 syntax. Migration to pydantic v2 was implemented in Refactor code base to make it work with pydantic v2 #1353.
  • Using geojson-pydantic was considered. It does not seem to play well with pydantic v2 right now. In addition, dumping the JSON schema of a geojson_pydantic.Feature instance results in a valid JSON schema for a GeoJSON Feature in general, which cannot be used if we, for example, want to constrain a specific geometry type (e.g. Point), and let the end-users know which geometry type is expected through the OpenAPI document. If this changes and that we are willing to add another dependency to pygeoapi, we may refactor the code to use geojson-pydantic.

Contributions and Licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

  • I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution.
  • I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines

@MTachon MTachon marked this pull request as draft September 11, 2023 19:31
@MTachon MTachon changed the title [WIP] GeoJSON schemas generation GeoJSON schemas generation Sep 20, 2023
@MTachon MTachon marked this pull request as ready for review September 20, 2023 11:24
Copy link
Contributor

@francbartoli francbartoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MTachon, I like the idea but on the other hand I'd rather if you can move the change for pydantic 2 into a separate PR. This would make the PR cleaner and better contextualised for the its title's scope

Copy link
Contributor Author

@MTachon MTachon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes related to the migration to pydantic V2 are now moved to PR #1353

@francbartoli
Copy link
Contributor

Hi @MTachon, thanks for the move. It would have been better and cleaner a rebase with the master branch instead of a merge. Let's wait for the review from @tomkralidis

@MTachon
Copy link
Contributor Author

MTachon commented Sep 22, 2023

Will clean up the branch.

@MTachon MTachon marked this pull request as draft September 22, 2023 19:22
@MTachon MTachon marked this pull request as ready for review September 23, 2023 10:14
@MTachon
Copy link
Contributor Author

MTachon commented Sep 23, 2023

Force pushed the rebased branch.

@francbartoli
Copy link
Contributor

+1, let's wait for @tomkralidis review

Use 'None'as default value for the 'bbox' property, as typing.Optional is not
JSON serializable.

Use Literal[None] instead of Optional[None] for 'geometry' and 'properties'
properties, when the create_geojson_feature* functions are called with their
'geom_type' and 'properties' parameters set to 'None', repectively.
Subclass pydantic.json_schema.GenerateJsonSchema, and take care of removing
'default' values for 'bbox' and 'id' properties when generating JSON schema for
GeoJSON Feature and GeoJSON FeatureCollection.
Copy link
Contributor

@francbartoli francbartoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MTachon, can we move this to a plugin? (cc @tomkralidis)

Something like pygeoapi/provider/vector_data_validator.py or similar

@MTachon
Copy link
Contributor Author

MTachon commented Oct 4, 2023

@francbartoli and @tomkralidis ,

Some reflections after discussing with @francbartoli:

  • This PR does not implement anything in data providers which uses data validation/schema generation functions, nor does it affect pygeoapi core implementation.
  • We may use the new modules from this PR as plugin for later provider implementations. The list of dependencies in requirements.txt is unchanged in this PR.
  • If we want to move this to a plugin, the dependencies needed by this PR could be listed in requirements-provider.txt instead? So that if we wish to use this PR's helper functions in provider implementations (e.g. get_schema()), installing the dependencies in requirements-provider.txt will be enough? If we want to use functionality from this PR in core (e.g. pygeoapi/api.py or pygeoapi/openapi.py), we could catch NotImplementedError exceptions in try/except blocks for the providers that do not implement get_schema() or get_data_model() abstract methods, what do you think? We may also want to implement a more sophisticated dependency injection pattern to leave pygeoapi core as abstract as possible.

@tomkralidis
Copy link
Member

Some comments:

  • for schema generation, are we able to use a provider's get_fields() function and build out a valid JSON schema by building up a Python dict and dumping?
  • for validation, are able to use jsonschema (which is an IETF standard) to perform validation against, say, https://geojson.org/schema/Feature.json?
  • for geometry validation, I would strongly caution against implementing anything that is already supported by Shapely (for example, any Shapely geometry object can use is_valid to verify geometry correctness)

In summary, I think the above can be realized without using pydantic, in the interest of robust and long term sustainability. If this is not possible, this could also be added as a 'validating' provider that other providers can inherit from if they so choose.

@MTachon
Copy link
Contributor Author

MTachon commented Oct 4, 2023

Thanks for the feedbacks @tomkralidis !

  • For schema generation, it is certainly possible to use a provider's get_fields() method, and build a python dict from it inside get_schema(). What I intended to do with this PR is implementing helper functions to build JSON schemas (dict instances), that could be re-used across provider implementations. I actually started implementing it in PR [WIP] OGC API Features - Part 4 / Support for PostgreSQLProvider #1266. I then thought it would be better to isolate the part related to schema generation in a separate PR, this PR. I started implementing helper functions in file pygeoapi/schemas.py, by building python dict instances myself. As you can see, it gets rather verbose, if you take into consideration that it is just for generating JSON schemas, which will allow for data validation at the schema/structural level only, and not at the semantic level (more about it further down in the comment).

    In addition, the implementation in PR [WIP] OGC API Features - Part 4 / Support for PostgreSQLProvider #1266 would require to use a mapping between python types and JSON types, to instantiate GeoJSONProperty dataclass objects. A similar approach is used in the get_fields() method, in pygeoapi/provider/postgresql.py, from merged PR Make sure that PostgreSQLProvider.get_fields returns valid json schem… #1312. With pydantic, you get this for free, the python types will be mapped directly by pydantic, and the correct JSON schema will be generated. If we still take the example of the PostgreSQLProvider, in the case where we have a PostgreSQL table with columns of date/time types, creating a pydantic model with fields of datetime.datetime or datetime.time data types, pydantic will translate the python data types to the "type: "string" type constraint in the generated JSON schema. In addition to the type constraint, it will also add "format": "date"/"format": "date-time"/"format": "time" annotations, which are nice for documenting a schema and which can be used for more accurate validation. "format" annotations and other types of data constraint are supported for a wide range of data types.

  • jsonschema is a great and lightweight library for data validation against JSON schemas. However, as JSON schemas cannot contain arbitrary code, some data constraints cannot be expressed, such as the closed linear rings case mentioned earlier. pydantic validators can help with such validation. Again, for data validation at the schema level only, jsonschema is great.

  • When it comes to geometry validation, I totally agree that validator functions should not try to implement anything complex. The default validator functions actually only check for valid GeoJSON Geometry (e.g. closed rings for polygons). If the aim is to further validate geometries (e.g. checking for self-intersecting polygons), we should indeed make use of the is_valid predicate of shapely geometrical objects, and let the GEOS geometry engine do the hard work.

I am sure this could be implemented without pydantic. But I think this is a trade-off between not relying on an additional external dependency on the one hand, and simpler/more compact implementation and more flexible/powerful data validation on the other hand, IMHO.

@MTachon
Copy link
Contributor Author

MTachon commented Oct 4, 2023

I see that I am not checking for clockwise vs. counterclockwise direction for linear rings, as mentioned in https://www.rfc-editor.org/rfc/rfc7946#section-3.1.6, in the default validator functions. I guess I am better off with using shapely's validation functions.

Copy link

As per RFC4, this Pull Request has been inactive for 90 days. In order to manage maintenance burden, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale Issue marked stale by stale-bot label Mar 10, 2024
Copy link

As per RFC4, this Pull Request has been closed due to there being no activity for more than 90 days.

@github-actions github-actions bot closed this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issue marked stale by stale-bot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants