Make the content: block of any element in the landing page tree searchable #335

KludgeKML · 2024-10-22T14:05:03Z

Landing pages have "blocks" rather than a body, and by convention we want anything in a block that has the key content: to be put into the search index. Because blocks can be arbitrarily nested, we can't use a simple JSONPath matcher.

Some blocks (govspeak ones) may be structured to contain multiple content types (eg if they were marked up as content-type: text/govspeak, publishing-api will have automatically created a rendered text/html and will present both of these to the search api in the message_queue message. This means that we can't easily use a JSONPath, so instead when we have a detals/blocks section we recurse into it. For each element as we travel down the path we either find that it's a content block with structure that includes a text/html content type (in which case we add its content to the search index), a scalar value (in which case we add it to the search index), or another type of array (in which case we recurse into that)

https://trello.com/c/CK919Zwq/81-make-block-content-findable-in-site-search

richardTowers · 2024-10-23T09:58:56Z

app/models/concerns/publishing_api/content.rb

+    def values_from_blocks(document_hash)
+      matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash)
+      return [] unless matches.any?
+
+      ignore_set = []
+      values = []
+      matches.each do |match|
+        case match
+        in Array
+          match.each { |m| ignore_set << m[:content] if m[:content].present? }
+          values << BodyContent.new(match).html_content
+        else
+          if ignore_set.index(match).present?
+            ignore_set.delete_at(ignore_set.index(match))
+          else
+            values << BodyContent.new(match).html_content
+          end
+        end
+      end
+
+      values


You might be able to do this more elegantly using recursion instead of JsonPath....

Maybe something like this (currently untested):

https://github.com/alphagov/search-api-v2/pull/336/files

Hmm, I think you'd want to explicitly match content_type: "text/html" in your code, otherwise you're potentially in the situation where a block with multiple options (text/html and text/govspeak) gets into the index twice. If you run the test suite, though, that should catch that situation.

This code gets run after publishing-api has done it's recursively_transform_govspeak bit I assume? So it should always have text/html (and may or may not also have text/govspeak)?

I like this @richardTowers – the problem @KludgeKML has been hitting (and me too on my brief attempts to dig into this) is the incomplete JSONPath support in the not actively maintained Ruby gem (insert classic XKCD here).

In theory, this single JSONPath expression should do the trick (find all content fields that either do not have a content_type subfield or have one and it's HTML):

$.details.blocks..content[?([email protected]_type || @.content_type == 'text/html')]

... and then we could have let the regular content extraction logic take over.

But there isn't support for negation of subexpressions in the Ruby library (there's a PR but it's not been merged in almost 2 years).

The JSONPath approach has been great for the bulk of static schema content (I still really like just having a giant list of possible paths!) but I begrudgingly accept we might not be able to make it work for blocks. If so the recursive approach you outlined seems neater than dealing with an ignore set (modulo some tidying to make it work).

This code gets run after publishing-api has done it's recursively_transform_govspeak bit I assume? So it should always have text/html (and may or may not also have text/govspeak)?

Publishing-api has done its recursive_transform_govspeak, so the text/html block will exist, but so will the original text/govspeak block. We don't want to dump both the text/html content AND the identical text/govspeak content into the search index ideally.

How would you feel about:

def values_from_blocks(item) case item in { content_type: "text/html", content: html_content } html_content in { content: String => content } unless item.key?(:content_type) content in Hash item.values.flat_map { values_from_blocks(_1) } in Array item.flat_map { values_from_blocks(_1) } else nil end end

And then calling that at the bottom of #content like so:

[ *values_from_json_paths, *values_from_parts, *values_from_blocks(document_hash.dig(:details, :blocks)), ].flatten .compact_blank .join(INDEXABLE_CONTENT_SEPARATOR) .truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)

(passes the tests for me and is quite neat and minimal!)

The question I was about to add was that what I needed to stop all the other tests failing was a guard clause checking for details/blocks outside of the recursive function, so that solution seems reasonable for me.

@richardTowers Do you want to

make your PR reviewable, then we can

merge it into this one, then I can

update the description for this PR, and then

Christian gives it the thumbs up and we're good to go?

Sure, although I've got a bit of a run of meetings so you might have to wait a few hours. I'm totally fine for you to just fold the changes into this one and close my draft PR if that's easier.

Done. @csutter do you want to give it one last eye over? (the active code is now your modified version of Richard's as above, the test and fixture are mine, unchanged from this morning)

leenagupte

I've got a couple inline comments about the logic, but on the whole it looks good!

leenagupte · 2024-10-23T10:19:24Z

app/models/concerns/publishing_api/content.rb

        .flatten
        .compact_blank
        .join(INDEXABLE_CONTENT_SEPARATOR)
        .truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)
    end
+
+    def values_from_blocks(document_hash)


I don't think document_hash needs to be passed into this method. The concern should already have access to it.

leenagupte · 2024-10-23T10:20:21Z

app/models/concerns/publishing_api/content.rb

        .flatten
        .compact_blank
        .join(INDEXABLE_CONTENT_SEPARATOR)
        .truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)
    end
+
+    def values_from_blocks(document_hash)
+      matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash)


This is nice!

leenagupte · 2024-10-23T10:24:40Z

app/models/concerns/publishing_api/content.rb

+      values = []
+      matches.each do |match|
+        case match
+        in Array


Ok, so this is checking the nested content label?

leenagupte · 2024-10-23T10:27:55Z

app/models/concerns/publishing_api/content.rb

+          match.each { |m| ignore_set << m[:content] if m[:content].present? }
+          values << BodyContent.new(match).html_content


Is an ignore_set needed? Can the content be transformed, i.e. have BodyContent.new(match).html_content run on it, and then check if already exists in values before adding it?

I think doing that might help with a recursive method call. So values_from_blocks sets the "matches" and then passes that to a new method that can also be called with a "content" array as well.

leenagupte · 2024-10-23T10:31:21Z

spec/integration/document_synchronization_spec.rb

+  describe "for a 'landing_page' message" do
+    let(:payload) { json_fixture_as_hash("message_queue/landing_page_message.json") }
+
+    it "is added to Discovery Engine through the Put service" do


This test is great! ❤️

- Landing pages have "blocks" rather than a body, and by convention we want anything in a block that has the key `content:` to be put into the search index. - However, some blocks (govspeak ones) may be structured to contain multiple content types (eg if they were marked up as content-type: text/govspeak, publishing-api will have automatically created a rendered text/html and will present both of these to the search api in the message_queue message. This means that we can't easily use a JSONPath, so instead when we have a detals/blocks section we recurse into it. For each element as we travel down the path we either find that it's a content block with structure that includes a text/html content type (in which case we add its content to the search index), a scalar value (in which case we add it to the search index), or another type of array (in which case we recurse into that)

csutter

Nice one! Welcome to #TeamSearchApiV2 🤩

KludgeKML force-pushed the handle-landing-page-content branch from 38d99a7 to 3f4d8c8 Compare October 23, 2024 08:59

KludgeKML changed the title ~~WIP~~ Make the content: block of any element in the landing page tree searchable Oct 23, 2024

KludgeKML marked this pull request as ready for review October 23, 2024 09:01

KludgeKML requested review from csutter, richardTowers and leenagupte October 23, 2024 09:01

richardTowers reviewed Oct 23, 2024

View reviewed changes

leenagupte reviewed Oct 23, 2024

View reviewed changes

KludgeKML force-pushed the handle-landing-page-content branch from 3f4d8c8 to 4c9c185 Compare October 23, 2024 12:24

csutter approved these changes Oct 23, 2024

View reviewed changes

richardTowers approved these changes Oct 23, 2024

View reviewed changes

KludgeKML merged commit 047264c into main Oct 23, 2024
8 checks passed

KludgeKML deleted the handle-landing-page-content branch October 23, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the content: block of any element in the landing page tree searchable #335

Make the content: block of any element in the landing page tree searchable #335

KludgeKML commented Oct 22, 2024 •

edited

Loading

richardTowers Oct 23, 2024

KludgeKML Oct 23, 2024

richardTowers Oct 23, 2024

csutter Oct 23, 2024

KludgeKML Oct 23, 2024

csutter Oct 23, 2024 •

edited

Loading

KludgeKML Oct 23, 2024

KludgeKML Oct 23, 2024

richardTowers Oct 23, 2024

KludgeKML Oct 23, 2024

leenagupte left a comment

leenagupte Oct 23, 2024

leenagupte Oct 23, 2024

leenagupte Oct 23, 2024

leenagupte Oct 23, 2024

leenagupte Oct 23, 2024 •

edited

Loading

leenagupte Oct 23, 2024

csutter left a comment

		match.each { \|m\| ignore_set << m[:content] if m[:content].present? }
		values << BodyContent.new(match).html_content

Make the content: block of any element in the landing page tree searchable #335

Make the content: block of any element in the landing page tree searchable #335

Conversation

KludgeKML commented Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csutter Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leenagupte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leenagupte Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csutter left a comment

Choose a reason for hiding this comment

KludgeKML commented Oct 22, 2024 •

edited

Loading

csutter Oct 23, 2024 •

edited

Loading

leenagupte Oct 23, 2024 •

edited

Loading