-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the content: block of any element in the landing page tree searchable #335
Conversation
38d99a7
to
3f4d8c8
Compare
def values_from_blocks(document_hash) | ||
matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash) | ||
return [] unless matches.any? | ||
|
||
ignore_set = [] | ||
values = [] | ||
matches.each do |match| | ||
case match | ||
in Array | ||
match.each { |m| ignore_set << m[:content] if m[:content].present? } | ||
values << BodyContent.new(match).html_content | ||
else | ||
if ignore_set.index(match).present? | ||
ignore_set.delete_at(ignore_set.index(match)) | ||
else | ||
values << BodyContent.new(match).html_content | ||
end | ||
end | ||
end | ||
|
||
values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to do this more elegantly using recursion instead of JsonPath....
Maybe something like this (currently untested):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think you'd want to explicitly match content_type: "text/html" in your code, otherwise you're potentially in the situation where a block with multiple options (text/html and text/govspeak) gets into the index twice. If you run the test suite, though, that should catch that situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code gets run after publishing-api has done it's recursively_transform_govspeak
bit I assume? So it should always have text/html
(and may or may not also have text/govspeak
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this @richardTowers – the problem @KludgeKML has been hitting (and me too on my brief attempts to dig into this) is the incomplete JSONPath support in the not actively maintained Ruby gem (insert classic XKCD here).
In theory, this single JSONPath expression should do the trick (find all content
fields that either do not have a content_type
subfield or have one and it's HTML):
$.details.blocks..content[?([email protected]_type || @.content_type == 'text/html')]
... and then we could have let the regular content extraction logic take over.
But there isn't support for negation of subexpressions in the Ruby library (there's a PR but it's not been merged in almost 2 years).
The JSONPath approach has been great for the bulk of static schema content (I still really like just having a giant list of possible paths!) but I begrudgingly accept we might not be able to make it work for blocks. If so the recursive approach you outlined seems neater than dealing with an ignore set (modulo some tidying to make it work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code gets run after publishing-api has done it's recursively_transform_govspeak bit I assume? So it should always have text/html (and may or may not also have text/govspeak)?
Publishing-api has done its recursive_transform_govspeak, so the text/html block will exist, but so will the original text/govspeak block. We don't want to dump both the text/html content AND the identical text/govspeak content into the search index ideally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about:
def values_from_blocks(item)
case item
in { content_type: "text/html", content: html_content }
html_content
in { content: String => content } unless item.key?(:content_type)
content
in Hash
item.values.flat_map { values_from_blocks(_1) }
in Array
item.flat_map { values_from_blocks(_1) }
else
nil
end
end
And then calling that at the bottom of #content
like so:
[
*values_from_json_paths,
*values_from_parts,
*values_from_blocks(document_hash.dig(:details, :blocks)),
].flatten
.compact_blank
.join(INDEXABLE_CONTENT_SEPARATOR)
.truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE)
(passes the tests for me and is quite neat and minimal!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question I was about to add was that what I needed to stop all the other tests failing was a guard clause checking for details/blocks outside of the recursive function, so that solution seems reasonable for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardTowers Do you want to
- make your PR reviewable, then we can
- merge it into this one, then I can
- update the description for this PR, and then
- Christian gives it the thumbs up and we're good to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although I've got a bit of a run of meetings so you might have to wait a few hours. I'm totally fine for you to just fold the changes into this one and close my draft PR if that's easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. @csutter do you want to give it one last eye over? (the active code is now your modified version of Richard's as above, the test and fixture are mine, unchanged from this morning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a couple inline comments about the logic, but on the whole it looks good!
.flatten | ||
.compact_blank | ||
.join(INDEXABLE_CONTENT_SEPARATOR) | ||
.truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE) | ||
end | ||
|
||
def values_from_blocks(document_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think document_hash
needs to be passed into this method. The concern should already have access to it.
.flatten | ||
.compact_blank | ||
.join(INDEXABLE_CONTENT_SEPARATOR) | ||
.truncate_bytes(INDEXABLE_CONTENT_MAX_BYTE_SIZE) | ||
end | ||
|
||
def values_from_blocks(document_hash) | ||
matches = JsonPath.new("$.details.blocks..content", use_symbols: true).on(document_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice!
values = [] | ||
matches.each do |match| | ||
case match | ||
in Array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this is checking the nested content
label?
match.each { |m| ignore_set << m[:content] if m[:content].present? } | ||
values << BodyContent.new(match).html_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an ignore_set
needed? Can the content be transformed, i.e. have BodyContent.new(match).html_content
run on it, and then check if already exists in values before adding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing that might help with a recursive method call. So values_from_blocks
sets the "matches" and then passes that to a new method that can also be called with a "content" array as well.
describe "for a 'landing_page' message" do | ||
let(:payload) { json_fixture_as_hash("message_queue/landing_page_message.json") } | ||
|
||
it "is added to Discovery Engine through the Put service" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is great! ❤️
- Landing pages have "blocks" rather than a body, and by convention we want anything in a block that has the key `content:` to be put into the search index. - However, some blocks (govspeak ones) may be structured to contain multiple content types (eg if they were marked up as content-type: text/govspeak, publishing-api will have automatically created a rendered text/html and will present both of these to the search api in the message_queue message. This means that we can't easily use a JSONPath, so instead when we have a detals/blocks section we recurse into it. For each element as we travel down the path we either find that it's a content block with structure that includes a text/html content type (in which case we add its content to the search index), a scalar value (in which case we add it to the search index), or another type of array (in which case we recurse into that)
3f4d8c8
to
4c9c185
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! Welcome to #TeamSearchApiV2 🤩
Landing pages have "blocks" rather than a body, and by convention we want anything in a block that has the key
content:
to be put into the search index. Because blocks can be arbitrarily nested, we can't use a simple JSONPath matcher.Some blocks (govspeak ones) may be structured to contain multiple content types (eg if they were marked up as content-type: text/govspeak, publishing-api will have automatically created a rendered text/html and will present both of these to the search api in the message_queue message. This means that we can't easily use a JSONPath, so instead when we have a detals/blocks section we recurse into it. For each element as we travel down the path we either find that it's a content block with structure that includes a text/html content type (in which case we add its content to the search index), a scalar value (in which case we add it to the search index), or another type of array (in which case we recurse into that)
https://trello.com/c/CK919Zwq/81-make-block-content-findable-in-site-search