Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the content: block of any element in the landing page tree searchable #335

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

KludgeKML
Copy link
Contributor

@KludgeKML KludgeKML commented Oct 22, 2024

Landing pages have "blocks" rather than a body, and by convention we want anything in a block that has the key content: to be put into the search index. Because blocks can be arbitrarily nested, we can't use a simple JSONPath matcher.

Some blocks (govspeak ones) may be structured to contain multiple content types (eg if they were marked up as content-type: text/govspeak, publishing-api will have automatically created a rendered text/html and will present both of these to the search api in the message_queue message. This means that we can't easily use a JSONPath, so instead when we have a detals/blocks section we recurse into it. For each element as we travel down the path we either find that it's a content block with structure that includes a text/html content type (in which case we add its content to the search index), a scalar value (in which case we add it to the search index), or another type of array (in which case we recurse into that)

https://trello.com/c/CK919Zwq/81-make-block-content-findable-in-site-search

@KludgeKML KludgeKML changed the title WIP Make the content: block of any element in the landing page tree searchable Oct 23, 2024
@KludgeKML KludgeKML marked this pull request as ready for review October 23, 2024 09:01
Comment on lines 79 to 99
def values_from_blocks(document_hash)
matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash)
return [] unless matches.any?

ignore_set = []
values = []
matches.each do |match|
case match
in Array
match.each { |m| ignore_set << m[:content] if m[:content].present? }
values << BodyContent.new(match).html_content
else
if ignore_set.index(match).present?
ignore_set.delete_at(ignore_set.index(match))
else
values << BodyContent.new(match).html_content
end
end
end

values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to do this more elegantly using recursion instead of JsonPath....

Maybe something like this (currently untested):

https://github.com/alphagov/search-api-v2/pull/336/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think you'd want to explicitly match content_type: "text/html" in your code, otherwise you're potentially in the situation where a block with multiple options (text/html and text/govspeak) gets into the index twice. If you run the test suite, though, that should catch that situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code gets run after publishing-api has done it's recursively_transform_govspeak bit I assume? So it should always have text/html (and may or may not also have text/govspeak)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this @richardTowers – the problem @KludgeKML has been hitting (and me too on my brief attempts to dig into this) is the incomplete JSONPath support in the not actively maintained Ruby gem (insert classic XKCD here).

In theory, this single JSONPath expression should do the trick (find all content fields that either do not have a content_type subfield or have one and it's HTML):

$.details.blocks..content[?([email protected]_type || @.content_type == 'text/html')]

... and then we could have let the regular content extraction logic take over.

But there isn't support for negation of subexpressions in the Ruby library (there's a PR but it's not been merged in almost 2 years).

The JSONPath approach has been great for the bulk of static schema content (I still really like just having a giant list of possible paths!) but I begrudgingly accept we might not be able to make it work for blocks. If so the recursive approach you outlined seems neater than dealing with an ignore set (modulo some tidying to make it work).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code gets run after publishing-api has done it's recursively_transform_govspeak bit I assume? So it should always have text/html (and may or may not also have text/govspeak)?

Publishing-api has done its recursive_transform_govspeak, so the text/html block will exist, but so will the original text/govspeak block. We don't want to dump both the text/html content AND the identical text/govspeak content into the search index ideally.

Copy link
Contributor

@csutter csutter Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about:

def values_from_blocks(item)
  case item
  in { content_type: "text/html", content: html_content }
    html_content
  in { content: String => content } unless item.key?(:content_type)
    content
  in Hash
    item.values.flat_map { values_from_blocks(_1) }
  in Array
    item.flat_map { values_from_blocks(_1) }
  else
    nil
  end
end

And then calling that at the bottom of #content like so:

[
  *values_from_json_paths,
  *values_from_parts,
  *values_from_blocks(document_hash.dig(:details, :blocks)),
].flatten
  .compact_blank
  .join(INDEXABLE_CONTENT_SEPARATOR)
  .truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)

(passes the tests for me and is quite neat and minimal!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question I was about to add was that what I needed to stop all the other tests failing was a guard clause checking for details/blocks outside of the recursive function, so that solution seems reasonable for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardTowers Do you want to

  1. make your PR reviewable, then we can
  2. merge it into this one, then I can
  3. update the description for this PR, and then
  4. Christian gives it the thumbs up and we're good to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although I've got a bit of a run of meetings so you might have to wait a few hours. I'm totally fine for you to just fold the changes into this one and close my draft PR if that's easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. @csutter do you want to give it one last eye over? (the active code is now your modified version of Richard's as above, the test and fixture are mine, unchanged from this morning)

Copy link
Contributor

@leenagupte leenagupte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a couple inline comments about the logic, but on the whole it looks good!

.flatten
.compact_blank
.join(INDEXABLE_CONTENT_SEPARATOR)
.truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)
end

def values_from_blocks(document_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think document_hash needs to be passed into this method. The concern should already have access to it.

.flatten
.compact_blank
.join(INDEXABLE_CONTENT_SEPARATOR)
.truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)
end

def values_from_blocks(document_hash)
matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!

values = []
matches.each do |match|
case match
in Array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this is checking the nested content label?

Comment on lines 88 to 89
match.each { |m| ignore_set << m[:content] if m[:content].present? }
values << BodyContent.new(match).html_content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an ignore_set needed? Can the content be transformed, i.e. have BodyContent.new(match).html_content run on it, and then check if already exists in values before adding it?

Copy link
Contributor

@leenagupte leenagupte Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing that might help with a recursive method call. So values_from_blocks sets the "matches" and then passes that to a new method that can also be called with a "content" array as well.

describe "for a 'landing_page' message" do
let(:payload) { json_fixture_as_hash("message_queue/landing_page_message.json") }

it "is added to Discovery Engine through the Put service" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is great! ❤️

- Landing pages have "blocks" rather than a body, and
  by convention we want anything in a block that has the
  key `content:` to be put into the search index.
- However, some blocks (govspeak ones) may be structured
  to contain multiple content types (eg if they were marked
  up as content-type: text/govspeak, publishing-api will
  have automatically created a rendered text/html and will
  present both of these to the search api in the
  message_queue message. This means that we can't easily
  use a JSONPath, so instead when we have a detals/blocks
  section we recurse into it. For each element as we
  travel down the path we either find that it's a
  content block with structure that includes a text/html
  content type (in which case we add its content to the
  search index), a scalar value (in which case we add it
  to the search index), or another type of array (in which
  case we recurse into that)
Copy link
Contributor

@csutter csutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Welcome to #TeamSearchApiV2 🤩

@KludgeKML KludgeKML merged commit 047264c into main Oct 23, 2024
8 checks passed
@KludgeKML KludgeKML deleted the handle-landing-page-content branch October 23, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants