Skip to content

Commit

Permalink
Add one-pager for Terraform registry scrapers and registry metadata
Browse files Browse the repository at this point in the history
enhanced Terrajet codegen pipelines

- Fixes crossplane/terrajet/issues/203

Signed-off-by: Alper Rifat Ulucinar <[email protected]>
  • Loading branch information
ulucinar committed Mar 10, 2022
1 parent 6494953 commit 99a70b1
Showing 1 changed file with 272 additions and 0 deletions.
272 changes: 272 additions & 0 deletions design/one-pager-terrajet-metadata-extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Metadata Extraction from Terraform Registry for Terrajet-based providers
* Owner: Alper Rifat Uluçınar (@ulucinar)
* Reviewers: Crossplane Maintainers
* Status: Draft

### Background
For providers generated using [Terrajet], the number of managed resources can
exceed [several hundreds](provider-jet-aws-preview), and especially for the big
three Terrajet-based providers ([provider-jet-aws], [provider-jet-gcp] and
[provider-jet-azure]), it's very inconvenient and time consuming to manually
author example manifests for all those generated resources. The convention we
have adopted so far is to manually add example manifests for the resources we
explicity configure in their respective pull requests.

Another dimension we need to consider is that currently we are lacking API
documentation for the generated resources. Although it's possible to redirect
users of those APIs to the [Terraform registry], it's desirable to have the
documentation generated together with the API (as comments on the associated
`struct`s and fields), and have them published on `doc.crds.dev`.

There is also a wealth of metadata that we can use to enrich the Terrajet-based
providers and the generated resources such as category names for the Terraform
resources. For example, a provider implementation may opt to use the category
names to group respective APIs. Or the examples provided in the Terraform
registry hint at reference fields that can help us in auto-generating
cross-resource references (that appear in those HCL configurations).

While working on generating example manifests for the big three Terrajet-based
providers `provider-jet-aws`, `provider-jet-gcp` and `provider-jet-azure` in the
context of the corresponding [Terrajet issue #48], we have seen utility in
extracting such metadata from the Terraform registry and use it to generate
example manifests and documentation. In this document, we would like to propose:
- A metadata format that we can optionally use in the Terrajet-based provider
repositories to generate example manifests, documentation, etc.,
- A concept of metadata extractors from Terraform registry and potentially from
other sources for Terrajet-based providers,
- A new Terrajet codegen pipeline to generate example manifests, which can
optionally be invoked in Terrajet-based providers during code generation,
accepting the scraped metadata from the Terraform registry.
- Extension of existing Terrajet codegen pipelines to also generate
documentation on `struct`s and fields.

### Goals
We would like to achieve the following goals with this proposal:
- The proposed new pipeline(s) or extensions of the existing code generation
pipelines must be optional. If, for example, an example generation pipeline is
not configured in a Terrajet-based provider repo, or if the already existing
code generation pipeline is not configured to also generate documentation,
then the behavior of the configured pipelines should not change. Thus,
configuration of the new pipelines or enhancement of existing ones with
registry-scraped metada should be optional.
- Like existing Terrajet pipelines, newly added registry metadata based
pipelines should be stable, i.e., running them on the same metadata must
always produce the same output. Simiarly, any extension of the existing
pipelines with registry metadata must preserve their stability.
- We would like to have means of correcting/adjusting scraped metadata before
it's input to the codegen pipelines. This would allow us to make manual
corrections/enhancements on the output of a scraper, or even manually craft
complete or semi-complete registry metadata documents, if for example the
provider is small (in the number of resources it supports), and an automatic
scraper is not immediately available. This will also allow us, if needed, to
have different scrapers that produce output in the same metadata format. For
instance, we may have a relatively complex scraper for extracting metadata
from the Terraform registry, and another relatively simple one that just adds
example HCL configurations by reading them from their respective
[files][aws-example-configurations]. This will allow different scraper
implementations to be able to fetch metadata from different sources but the
Terrajet pipelines will always be working on a well defined format regardless
of how those metadata are scraped.
- We would like to have the scrapers run as needed, produce their output in the
common metadata format, and to have the metadata documents added to their
respective repositories. However, we can then have the corresponding pipelines
run each time with a `make generate`, just like the existing codegen pipelines
we have. This would allow us to separate the lifecycles of metadata-scraping
and code generation.

### Metadata Format
The proposed syntax for scraped metadata documents is YAML as we would also like
the metadata to be human readable, searchable and maintainable, if needed. A
concrete example of a scraped registry metadata document for a resource named
`azurerm_analysis_services_server` of the native Terraform provider
[terraform-provider-azurerm] could be as follows:

```yaml
# Terraform native provider name
name: hashicorp/terraform-provider-azurerm
# map from Terraform native resource names to scraped resource metadata
resources:
# a Terraform native resource name defined in the provider
azurerm_analysis_services_server:
# sub-category metadata for the resource extracted from Terraform registry, if available.
# Candidate to be used as API group names in the generated Terrajet provider, if desired.
subCategory: Analysis Services
# description for the resource extracted from Terraform registry, if available.
# Candidate to be used as the CRD type documentation
description: Manages an Analysis Services Server.
# title for the resource as it appears in the registry.
titleName: azurerm_analysis_services_server
# Array of example HCL configurations available for the Terraform resource.
# Terraform registry contains examples but there can be other sources as well.
examples:
# example configuration in HCL syntax
- manifest: |-
{
"admin_users": [
"[email protected]"
],
"enable_power_bi_service": true,
"ipv4_firewall_rule": [
{
"name": "myRule1",
"range_end": "210.117.252.255",
"range_start": "210.117.252.0"
}
],
"location": "northeurope",
"name": "analysisservicesserver",
"resource_group_name": "${azurerm_resource_group.rg.name}",
"sku": "S0",
"tags": {
"abc": 123
}
}
# reference parameters extracted from Terraform registry examples
# map from referer parameter names to referee <target resource type>.<target field>
references:
# for example, "azurerm_analysis_services_server" has a parameter
# named "resource_group_name" that refers to a "azurerm_resource_group"'s
# "name" parameter
# Candidate for auto-generating cross-resource references
resource_group_name: azurerm_resource_group.name
# scraped Terraform registry docs for the parameters and attributes of the resource
argumentDocs:
# parameters with non-block values map directly to doc strings
admin_users: '- (Optional) List of email addresses of admin users.'
backup_blob_container_uri: '- (Optional) URI and SAS token for a blob container to store backups.'
enable_power_bi_service: '- (Optional) Indicates if the Power BI service is allowed to access or not.'
# exported attributes appear under the "exportedAttributes" map (as a block)
exportedAttributes:
id: '- The ID of the Analysis Services Server.'
server_full_name: '- The full name of the Analysis Services Server.'
# parameters with block values are maps
ipv4_firewall_rule:
name: '- (Required) Specifies the name of the firewall rule.'
# if the block-valued parameter has itself a description, it appears under "nodeText"
# We assume "nodeText" is not a valid parameter/attribute name
nodeText: '- (Optional) One or more ipv4_firewall_rule block(s) as defined below.'
range_end: '- (Required) End of the firewall rule range as IPv4 address.'
range_start: '- (Required) Start of the firewall rule range as IPv4 address.'
location: '- (Required) The Azure location where the Analysis Services Server exists. Changing this forces a new resource to be created.'
name: '- (Required) The name of the Analysis Services Server. Changing this forces a new resource to be created.'
querypool_connection_mode: '- (Optional) Controls how the read-write server is used in the query pool. If this value is set to All then read-write servers are also used for queries. Otherwise with ReadOnly these servers do not participate in query operations.'
resource_group_name: '- (Required) The name of the Resource Group in which the Analysis Services Server should be exist. Changing this forces a new resource to be created.'
sku: '- (Required) SKU for the Analysis Services Server. Possible values are: D1, B1, B2, S0, S1, S2, S4, S8, S9, S8v2 and S9v2.'
timeouts:
create: '- (Defaults to 30 minutes) Used when creating the Analysis Services Server.'
delete: '- (Defaults to 30 minutes) Used when deleting the Analysis Services Server.'
read: '- (Defaults to 5 minutes) Used when retrieving the Analysis Services Server.'
update: '- (Defaults to 30 minutes) Used when updating the Analysis Services Server.'
# import statement scraped from the Terraform registry, if available
# Can be used for advanced purposes, such as constructing resource config "ExternalName.GetIDFn" functions, etc.
importStatements:
- terraform import azurerm_analysis_services_server.server /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/resourcegroup1/providers/Microsoft.AnalysisServices/servers/server1
```
Another alternative could be to have qualified names under the `argumentDocs`
with a flat hierarchy, e.g., instead of a nested `ipv4_firewall_rule` block
represented as a map, we could have its block parameters qualified with the
configuration block name (`ipv4_firewall_rule.range_start`,
`ipv4_firewall_rule.range_end`, etc.) Then `argumentDocs` would become a simple
`map[string]string`.

Another alternative could be to have per-resource YAML metadata files, i.e.,
instead of the `resources` map in a single file, we could have each of its keys
(and associated metadata) stored in a resource specific YAML-formatted file.
These resource specific files could each be named as `<Terraform resource
type>.yaml`, e.g., `azurerm_analysis_services_server.yaml`.

### Metadata scrapers
Although not validated on all of available Terraform providers, at least, the
big three Terraform providers ([terraform-provider-aws],
[terraform-provider-azurerm] and [terraform-provider-google]) all have Terraform
registry content in their respective repositories and use markdown documents
with a common structure. Our assumption is that Terraform registry website is
also generated using these markdown files:
- For `terraform-provider-aws`:
https://github.com/hashicorp/terraform-provider-aws/tree/main/website/docs/r
- For `terraform-provider-azurerm`:
https://github.com/hashicorp/terraform-provider-azurerm/tree/main/website/docs/r
- For `terraform-provider-google`:
https://github.com/hashicorp/terraform-provider-google/tree/main/website/docs/r

Thus a common metadata scraper implementation can extract metadata from these
well-formatted per-resource markdown documents. Any spotted errors can then
potentially be corrected manually in the scraped YAML metadata document.
Scrapers can optionally be chained: If desired, another scraper can append
example HCL configurations read from a different source (such as the `examples`
folder found in some of the Terraform native provider repositories as discussed
above).

For most Terraform native providers, we anticipate that Terraform registry
scrapers will **not** run on HTTP, as the resource markdown files are part of
their corresponding provider repositories. They can just read those markdowns
from a pointed directory in the local filesystem, which is specified as a
command-line argument, for instance.

As already indicated, if it turns out that a common registry scraper
implementation is not suitable for a specific Terraform native provider, then a
new scraper can be written as long as it produces metadata output in the
expected metadata format by the Terrajet pipelines. Or even, if the cost of
writing a new scraper is higher than manually authoring the metadata YAML
documents (e.g., the number of resources in the native provider is small), we
can just prepare the metadata YAML(s) by hand, just like the Terraform community
manually maintains the corresponding markdown documents for the Terraform
registry.

### Terrajet Codegen Pipelines Consuming Metadata
As [implemented][terrajet-pr-173] in the context of the [Terrajet issue #48] for
example manifest generation for the big three providers, we could have some
configurable codegen pipelines that consume the YAML resource metadata file(s)
and produce example manifests, CRD documentation, etc. (as discussed above). The
configured pipelines should not fail if the necessary metadata is missing: For
instance, the [example manifest generation pipeline] should simply not generate
an example manifest for a managed resource, if no sample HCL configuration is
available for the corresponding Terraform resource in the metadata. Or, the
metadata-enhanced CRD generation pipeline should simply skip doc comments if
none or some are not available in the corresponding metadata document.

Metadata is valuable; the scrapers should capture as much metadata as possible
and store them in the common format, even for future use cases we do not yet
envision. New Terrajet pipelines can be added, or existing ones can be enhanced
to support advanced use cases. One such proposal could be to extend the CRD
generation pipeline to employ the `resource.subCategory` metadata to determine
the API group of a generated CRD (after some simple string processing). Or as an
alternative, another provider could use the `resource.importStatements` metadata
for exactly the same purpose. For example, `provider-jet-azure` currently
[uses][provider-jet-azure-group-config] what we call as the Microsoft provider
name as a default for the API groups of generated resources. Of course, resource
specific manual overrides are always possible via the [resource configuration
API] and for `provider-jet-azure`, most resource IDs have the Microsoft provider
name as a component such as:
```
/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/resourcegroup1/providers/Microsoft.AnalysisServices/servers/server1
```
Because these ID strings appear in the import statements of most
`terraform-provider-azurerm` resources, using such metadata enables us to have a
consistent, repo-wide defaulting for the API group names of the generated resources.

[Terrajet]: https://github.com/crossplane/terrajet
[provider-jet-aws-preview]:
https://doc.crds.dev/github.com/crossplane-contrib/[email protected]
[Terraform registry]: https://registry.terraform.io/
[provider-jet-aws]: https://github.com/crossplane-contrib/provider-jet-aws
[provider-jet-gcp]: https://github.com/crossplane-contrib/provider-jet-gcp
[provider-jet-azure]: https://github.com/crossplane-contrib/provider-jet-azure
[Terrajet issue #48]: https://github.com/crossplane/terrajet/issues/48
[aws-example-configurations]:
https://github.com/hashicorp/terraform-provider-aws/tree/main/examples
[terraform-provider-azurerm]:
https://github.com/hashicorp/terraform-provider-azurerm
[terraform-provider-azurerm]:
https://github.com/hashicorp/terraform-provider-azurerm
[terraform-provider-aws]: https://github.com/hashicorp/terraform-provider-aws
[terraform-provider-google]:
https://github.com/hashicorp/terraform-provider-google
[terrajet-pr-173]: https://github.com/crossplane/terrajet/pull/173
[example manifest generation pipeline]:
https://github.com/ulucinar/terrajet/blob/fix-48/pkg/pipeline/example.go
[provider-jet-azure-group-config]:
https://github.com/crossplane-contrib/provider-jet-azure/blob/main/config/apigroup_config.go
[resource configuration API]:
https://github.com/crossplane/terrajet/blob/main/pkg/config/resource.go

0 comments on commit 99a70b1

Please sign in to comment.