Add one-pager for Terraform registry scrapers and registry metadata

enhanced Terrajet codegen pipelines - Fixes crossplane/terrajet/issues/203 Signed-off-by: Alper Rifat Ulucinar <[email protected]>
ulucinar · Mar 10, 2022 · 99a70b1 · 99a70b1
1 parent 6494953
commit 99a70b1
Showing 1 changed file with 272 additions and 0 deletions.
diff --git a/design/one-pager-terrajet-metadata-extraction.md b/design/one-pager-terrajet-metadata-extraction.md
@@ -0,0 +1,272 @@
+# Metadata Extraction from Terraform Registry for Terrajet-based providers
+* Owner: Alper Rifat Uluçınar (@ulucinar)
+* Reviewers: Crossplane Maintainers
+* Status: Draft
+
+### Background
+For providers generated using [Terrajet], the number of managed resources can
+exceed [several hundreds](provider-jet-aws-preview), and especially for the big
+three Terrajet-based providers ([provider-jet-aws], [provider-jet-gcp] and
+[provider-jet-azure]), it's very inconvenient and time consuming to manually
+author example manifests for all those generated resources. The convention we
+have adopted so far is to manually add example manifests for the resources we
+explicity configure in their respective pull requests. 
+
+Another dimension we need to consider is that currently we are lacking API
+documentation for the generated resources. Although it's possible to redirect
+users of those APIs to the [Terraform registry], it's desirable to have the
+documentation generated together with the API (as comments on the associated
+`struct`s and fields), and have them published on `doc.crds.dev`.
+
+There is also a wealth of metadata that we can use to enrich the Terrajet-based
+providers and the generated resources such as category names for the Terraform
+resources. For example, a provider implementation may opt to use the category
+names to group respective APIs. Or the examples provided in the Terraform
+registry hint at reference fields that can help us in auto-generating
+cross-resource references (that appear in those HCL configurations).
+
+While working on generating example manifests for the big three Terrajet-based
+providers `provider-jet-aws`, `provider-jet-gcp` and `provider-jet-azure` in the
+context of the corresponding [Terrajet issue #48], we have seen utility in
+extracting such metadata from the Terraform registry and use it to generate
+example manifests and documentation. In this document, we would like to propose:
+- A metadata format that we can optionally use in the Terrajet-based provider
+repositories to generate example manifests, documentation, etc., 
+- A concept of metadata extractors from Terraform registry and potentially from
+  other sources for Terrajet-based providers, 
+- A new Terrajet codegen pipeline to generate example manifests, which can
+optionally be invoked in Terrajet-based providers during code generation,
+accepting the scraped metadata from the Terraform registry. 
+- Extension of existing Terrajet codegen pipelines to also generate
+  documentation on `struct`s and fields.
+
+### Goals
+We would like to achieve the following goals with this proposal:
+- The proposed new pipeline(s) or extensions of the existing code generation
+  pipelines must be optional. If, for example, an example generation pipeline is
+  not configured in a Terrajet-based provider repo, or if the already existing
+  code generation pipeline is not configured to also generate documentation,
+  then the behavior of the configured pipelines should not change. Thus,
+  configuration of the new pipelines or enhancement of existing ones with
+  registry-scraped metada should be optional.
+- Like existing Terrajet pipelines, newly added registry metadata based
+  pipelines should be stable, i.e., running them on the same metadata must
+  always produce the same output. Simiarly, any extension of the existing
+  pipelines with registry metadata must preserve their stability.
+- We would like to have means of correcting/adjusting scraped metadata before
+  it's input to the codegen pipelines. This would allow us to make manual
+  corrections/enhancements on the output of a scraper, or even manually craft
+  complete or semi-complete registry metadata documents, if for example the
+  provider is small (in the number of resources it supports), and an automatic
+  scraper is not immediately available. This will also allow us, if needed, to
+  have different scrapers that produce output in the same metadata format. For
+  instance, we may have a relatively complex scraper for extracting metadata
+  from the Terraform registry, and another relatively simple one that just adds
+  example HCL configurations by reading them from their respective
+  [files][aws-example-configurations]. This will allow different scraper
+  implementations to be able to fetch metadata from different sources but the
+  Terrajet pipelines will always be working on a well defined format regardless
+  of how those metadata are scraped.
+- We would like to have the scrapers run as needed, produce their output in the
+  common metadata format, and to have the metadata documents added to their
+  respective repositories. However, we can then have the corresponding pipelines
+  run each time with a `make generate`, just like the existing codegen pipelines
+  we have. This would allow us to separate the lifecycles of metadata-scraping
+  and code generation.
+
+### Metadata Format
+The proposed syntax for scraped metadata documents is YAML as we would also like
+the metadata to be human readable, searchable and maintainable, if needed. A
+concrete example of a scraped registry metadata document for a resource named
+`azurerm_analysis_services_server` of the native Terraform provider
+[terraform-provider-azurerm] could be as follows:
+
+```yaml
+# Terraform native provider name
+name: hashicorp/terraform-provider-azurerm
+# map from Terraform native resource names to scraped resource metadata
+resources:
+    # a Terraform native resource name defined in the provider
+    azurerm_analysis_services_server:
+        # sub-category metadata for the resource extracted from Terraform registry, if available. 
+        # Candidate to be used as API group names in the generated Terrajet provider, if desired.
+        subCategory: Analysis Services
+        # description for the resource extracted from Terraform registry, if available. 
+        # Candidate to be used as the CRD type documentation
+        description: Manages an Analysis Services Server.
+        # title for the resource as it appears in the registry.
+        titleName: azurerm_analysis_services_server
+        # Array of example HCL configurations available for the Terraform resource. 
+        # Terraform registry contains examples but there can be other sources as well.
+        examples:
+            # example configuration in HCL syntax
+            - manifest: |-
+                {
+                  "admin_users": [
+                    "[email protected]"
+                  ],
+                  "enable_power_bi_service": true,
+                  "ipv4_firewall_rule": [
+                    {
+                      "name": "myRule1",
+                      "range_end": "210.117.252.255",
+                      "range_start": "210.117.252.0"
+                    }
+                  ],
+                  "location": "northeurope",
+                  "name": "analysisservicesserver",
+                  "resource_group_name": "${azurerm_resource_group.rg.name}",
+                  "sku": "S0",
+                  "tags": {
+                    "abc": 123
+                  }
+                }
+              # reference parameters extracted from Terraform registry examples
+              # map from referer parameter names to referee <target resource type>.<target field>
+              references:
+                # for example, "azurerm_analysis_services_server" has a parameter 
+                # named "resource_group_name" that refers to a "azurerm_resource_group"'s 
+                # "name" parameter
+                # Candidate for auto-generating cross-resource references
+                resource_group_name: azurerm_resource_group.name
+        # scraped Terraform registry docs for the parameters and attributes of the resource
+        argumentDocs:
+            # parameters with non-block values map directly to doc strings
+            admin_users: '- (Optional) List of email addresses of admin users.'
+            backup_blob_container_uri: '- (Optional) URI and SAS token for a blob container to store backups.'
+            enable_power_bi_service: '- (Optional) Indicates if the Power BI service is allowed to access or not.'
+            # exported attributes appear under the "exportedAttributes" map (as a block)
+            exportedAttributes:
+                id: '- The ID of the Analysis Services Server.'
+                server_full_name: '- The full name of the Analysis Services Server.'            
+            # parameters with block values are maps
+            ipv4_firewall_rule:
+                name: '- (Required) Specifies the name of the firewall rule.'
+                # if the block-valued parameter has itself a description, it appears under "nodeText"
+                # We assume "nodeText" is not a valid parameter/attribute name
+                nodeText: '- (Optional) One or more ipv4_firewall_rule block(s) as defined below.'
+                range_end: '- (Required) End of the firewall rule range as IPv4 address.'
+                range_start: '- (Required) Start of the firewall rule range as IPv4 address.'
+            location: '- (Required) The Azure location where the Analysis Services Server exists. Changing this forces a new resource to be created.'
+            name: '- (Required) The name of the Analysis Services Server. Changing this forces a new resource to be created.'
+            querypool_connection_mode: '- (Optional) Controls how the read-write server is used in the query pool. If this value is set to All then read-write servers are also used for queries. Otherwise with ReadOnly these servers do not participate in query operations.'
+            resource_group_name: '- (Required) The name of the Resource Group in which the Analysis Services Server should be exist. Changing this forces a new resource to be created.'
+            sku: '- (Required) SKU for the Analysis Services Server. Possible values are: D1, B1, B2, S0, S1, S2, S4, S8, S9, S8v2 and S9v2.'
+            timeouts:
+                create: '- (Defaults to 30 minutes) Used when creating the Analysis Services Server.'
+                delete: '- (Defaults to 30 minutes) Used when deleting the Analysis Services Server.'
+                read: '- (Defaults to 5 minutes) Used when retrieving the Analysis Services Server.'
+                update: '- (Defaults to 30 minutes) Used when updating the Analysis Services Server.'
+        # import statement scraped from the Terraform registry, if available
+        # Can be used for advanced purposes, such as constructing resource config "ExternalName.GetIDFn" functions, etc.
+        importStatements:
+            - terraform import azurerm_analysis_services_server.server /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/resourcegroup1/providers/Microsoft.AnalysisServices/servers/server1
+```
+
+Another alternative could be to have qualified names under the `argumentDocs`
+with a flat hierarchy, e.g., instead of a nested `ipv4_firewall_rule` block
+represented as a map, we could have its block parameters qualified with the
+configuration block name (`ipv4_firewall_rule.range_start`,
+`ipv4_firewall_rule.range_end`, etc.) Then `argumentDocs` would become a simple
+`map[string]string`.
+
+Another alternative could be to have per-resource YAML metadata files, i.e.,
+instead of the `resources` map in a single file, we could have each of its keys
+(and associated metadata) stored in a resource specific YAML-formatted file.
+These resource specific files could each be named as `<Terraform resource
+type>.yaml`, e.g., `azurerm_analysis_services_server.yaml`. 
+
+### Metadata scrapers
+Although not validated on all of available Terraform providers, at least, the
+big three Terraform providers ([terraform-provider-aws],
+[terraform-provider-azurerm] and [terraform-provider-google]) all have Terraform
+registry content in their respective repositories and use markdown documents
+with a common structure. Our assumption is that Terraform registry website is
+also generated using these markdown files:
+- For `terraform-provider-aws`:
+  https://github.com/hashicorp/terraform-provider-aws/tree/main/website/docs/r
+- For `terraform-provider-azurerm`:
+  https://github.com/hashicorp/terraform-provider-azurerm/tree/main/website/docs/r
+- For `terraform-provider-google`:
+  https://github.com/hashicorp/terraform-provider-google/tree/main/website/docs/r
+
+Thus a common metadata scraper implementation can extract metadata from these
+well-formatted per-resource markdown documents. Any spotted errors can then
+potentially be corrected manually in the scraped YAML metadata document.
+Scrapers can optionally be chained: If desired, another scraper can append
+example HCL configurations read from a different source (such as the `examples`
+folder found in some of the Terraform native provider repositories as discussed
+above).
+
+For most Terraform native providers, we anticipate that Terraform registry
+scrapers will **not** run on HTTP, as the resource markdown files are part of
+their corresponding provider repositories. They can just read those markdowns
+from a pointed directory in the local filesystem, which is specified as a
+command-line argument, for instance. 
+
+As already indicated, if it turns out that a common registry scraper
+implementation is not suitable for a specific Terraform native provider, then a
+new scraper can be written as long as it produces metadata output in the
+expected metadata format by the Terrajet pipelines. Or even, if the cost of
+writing a new scraper is higher than manually authoring the metadata YAML
+documents (e.g., the number of resources in the native provider is small), we
+can just prepare the metadata YAML(s) by hand, just like the Terraform community
+manually maintains the corresponding markdown documents for the Terraform
+registry.
+
+### Terrajet Codegen Pipelines Consuming Metadata
+As [implemented][terrajet-pr-173] in the context of the [Terrajet issue #48] for
+example manifest generation for the big three providers, we could have some
+configurable codegen pipelines that consume the YAML resource metadata file(s)
+and produce example manifests, CRD documentation, etc. (as discussed above). The
+configured pipelines should not fail if the necessary metadata is missing: For
+instance, the [example manifest generation pipeline] should simply not generate
+an example manifest for a managed resource, if no sample HCL configuration is
+available for the corresponding Terraform resource in the metadata. Or, the
+metadata-enhanced CRD generation pipeline should simply skip doc comments if
+none or some are not available in the corresponding metadata document. 
+
+Metadata is valuable; the scrapers should capture as much metadata as possible
+and store them in the common format, even for future use cases we do not yet
+envision. New Terrajet pipelines can be added, or existing ones can be enhanced
+to support advanced use cases. One such proposal could be to extend the CRD
+generation pipeline to employ the `resource.subCategory` metadata to determine
+the API group of a generated CRD (after some simple string processing). Or as an
+alternative, another provider could use the `resource.importStatements` metadata
+for exactly the same purpose. For example, `provider-jet-azure` currently
+[uses][provider-jet-azure-group-config] what we call as the Microsoft provider
+name as a default for the API groups of generated resources. Of course, resource
+specific manual overrides are always possible via the [resource configuration
+API] and for `provider-jet-azure`, most resource IDs have the Microsoft provider
+name as a component such as:
+```
+/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/resourcegroup1/providers/Microsoft.AnalysisServices/servers/server1
+```
+Because these ID strings appear in the import statements of most
+`terraform-provider-azurerm` resources, using such metadata enables us to have a
+consistent, repo-wide defaulting for the API group names of the generated resources. 
+
+[Terrajet]: https://github.com/crossplane/terrajet
+[provider-jet-aws-preview]:
+    https://doc.crds.dev/github.com/crossplane-contrib/[email protected]
+[Terraform registry]: https://registry.terraform.io/
+[provider-jet-aws]: https://github.com/crossplane-contrib/provider-jet-aws
+[provider-jet-gcp]: https://github.com/crossplane-contrib/provider-jet-gcp
+[provider-jet-azure]: https://github.com/crossplane-contrib/provider-jet-azure
+[Terrajet issue #48]: https://github.com/crossplane/terrajet/issues/48
+[aws-example-configurations]:
+    https://github.com/hashicorp/terraform-provider-aws/tree/main/examples
+[terraform-provider-azurerm]:
+    https://github.com/hashicorp/terraform-provider-azurerm
+[terraform-provider-azurerm]:
+    https://github.com/hashicorp/terraform-provider-azurerm
+[terraform-provider-aws]: https://github.com/hashicorp/terraform-provider-aws
+[terraform-provider-google]:
+    https://github.com/hashicorp/terraform-provider-google
+[terrajet-pr-173]: https://github.com/crossplane/terrajet/pull/173
+[example manifest generation pipeline]:
+    https://github.com/ulucinar/terrajet/blob/fix-48/pkg/pipeline/example.go
+[provider-jet-azure-group-config]:
+    https://github.com/crossplane-contrib/provider-jet-azure/blob/main/config/apigroup_config.go
+[resource configuration API]:
+    https://github.com/crossplane/terrajet/blob/main/pkg/config/resource.go