Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Google Analytics tracking #39

Merged

Conversation

lewisnyman
Copy link
Contributor

This PR adds event tracking on search result clicks and on search terms, there's a wait time of a second after a user has finished typing before an event is sent. This might be too low and we might need to increase this once we have real usage.

@@ -176,6 +196,16 @@
$html.removeClass('has-search-results-open');
}

function sendQueryToAnalytics() {
ga('send', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the query is not empty here - if the user starts typing and then deletes their input I don't think we need an event for it

if(window.ga) {
$searchResults.on('click', '.search-result__title a', function() {
var href = $(this).attr('href');
ga('send', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to tie these events to the query that generated them?

Ideally I'd like to be able to find out which queries people are abandoning without clicking anything, because those could highlight problems with the documentation itself, or problems with the search (like the ones that came up in alphagov/registers-tech-docs#109)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this would be useful. I think your best bet here is using the event flow report. With a custom "users with search event" segment, you should be able to get a useful view.

@@ -50,12 +52,16 @@
// Search functionality on search text input
$searchInput.on('input', function (e) {
e.preventDefault();
var query = $(this).val();
query = $(this).val();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do any sanitisation on this before sending it to GA? We do scrub things that look like PII from data sent to GA in other places, eg https://github.com/alphagov/govuk_frontend_toolkit/blob/master/javascripts/govuk/analytics/analytics.js

I'm not aware of any uses where I'd expect users to type in PII but as this is a template it's hard to foresee what it will be used for.

I would suggest either making this functionality opt-in, or adding some explanation of what it does to the ga-tracking-id documentation, so that site creators are aware of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I've borrowed that stripPII function. I'll leave the full integration for #34.

I'll talk to @jonathanglassman tomorrow about making sure all the new GA tracking events are documented.

I've also opened an issue about the date regex: alphagov/govuk_frontend_toolkit#469

@jonathanglassman jonathanglassman merged commit de0c518 into alphagov:master Aug 2, 2018
ChrisBAshton added a commit that referenced this pull request Apr 11, 2023
Pages that are indexed in search results have their entire
contents indexed, including any HTML code snippets. These HTML
snippets would appear in the search results unsanitised, so it
was possible to render arbitrary HTML or run arbitrary scripts:

> ![script being invoked](https://user-images.githubusercontent.com/5111927/230888935-0367b598-eda7-4f67-afb5-799b41684ee3.png)
> ![HTML being rendered](https://user-images.githubusercontent.com/5111927/230888939-f0056edc-6955-4f10-8aee-c93414b1cb69.png)

This is a largely theoretical security issue; to exploit it, an
attacker would need to find a way of committing malicious code
to a page indexed by a site that uses tech-docs-gem (which are
typically not editable by untrusted users). Their code would
also be limited by the relatively short length that's rendered
in the corresponding search result. Nevertheless, the XSS would
then be triggerable by visiting a pre-constructed URL
(`/search/index.html?q=some+search+term`), which users could be
tricked into clicking on through social engineering.

This commit sanitises the HTML before rendering it to the page.
It does so whilst retaining the `<mark data-markjs="true">`
behaviour that highlights the search term in the result:

> ![sanitised HTML with highlights](https://user-images.githubusercontent.com/5111927/230888944-9aaf4920-cddd-43f9-8ef5-17f15785af73.png)

I've used jQuery's `text()` function for sanitisation, as that is
the approach used elsewhere in the project ([1]). Unfortunately
this can't be done in memory, as it requires an existing DOM node
to act upon ([2]). I've therefore had to create a temporary node,
which then gets immediately removed via the `detach` method, which
returns a jQuery object we can then manipulate with `text()` and
`html()` ([3]).

I did consider using native JavaScript (using the same approach as
in Mustache [4]) to avoid this hack, but this itself may contain
bugs and would lead to having two sanitisation approaches to
maintain, so I opted against it. For future reference, the code in
this commit can be swapped out with:

```js
var entityMap = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#39;',
  '/': '&#x2F;',
  '`': '&#x60;',
  '=': '&#x3D;'
};
var sanitizedContent = String(content).replace(/[&<>"'`=\/]/g, function (s) {
  return entityMap[s];
});
```

[1]: https://github.com/alphagov/tech-docs-gem/blob/66cc7ab0a06dc2f1fe89de8cba2270fcf46f6466/lib/assets/javascripts/_modules/search.js#L202-L204
[2]: http://api.jquery.com/text/#text2
[3]: http://api.jquery.com/detach/
[4]: https://github.com/janl/mustache.js/blob/972fd2b27a036888acfcb60d6119317744fac7ee/mustache.js#L60-L75
ChrisBAshton added a commit that referenced this pull request Apr 11, 2023
Pages that are indexed in search results have their entire
contents indexed, including any HTML code snippets. These HTML
snippets would appear in the search results unsanitised, so it
was possible to render arbitrary HTML or run arbitrary scripts:

> ![script being invoked](https://user-images.githubusercontent.com/5111927/230888935-0367b598-eda7-4f67-afb5-799b41684ee3.png)
> ![HTML being rendered](https://user-images.githubusercontent.com/5111927/230888939-f0056edc-6955-4f10-8aee-c93414b1cb69.png)

This is a largely theoretical security issue; to exploit it, an
attacker would need to find a way of committing malicious code
to a page indexed by a site that uses tech-docs-gem (which are
typically not editable by untrusted users). Their code would
also be limited by the relatively short length that's rendered
in the corresponding search result. Nevertheless, the XSS would
then be triggerable by visiting a pre-constructed URL
(`/search/index.html?q=some+search+term`), which users could be
tricked into clicking on through social engineering.

This commit sanitises the HTML before rendering it to the page.
It does so whilst retaining the `<mark data-markjs="true">`
behaviour that highlights the search term in the result:

> ![sanitised HTML with highlights](https://user-images.githubusercontent.com/5111927/230888944-9aaf4920-cddd-43f9-8ef5-17f15785af73.png)

I've used jQuery's `text()` function for sanitisation, as that is
the approach used elsewhere in the project ([1]). Unfortunately
this can't be done in memory, as it requires an existing DOM node
to act upon ([2]). I've therefore had to create a temporary node,
which then gets immediately removed via the `detach` method, which
returns a jQuery object we can then manipulate with `text()` and
`html()` ([3]).

I did consider using native JavaScript (using the same approach as
in Mustache [4]) to avoid this hack, but this itself may contain
bugs and would lead to having two sanitisation approaches to
maintain, so I opted against it. For future reference, the code in
this commit can be swapped out with:

```js
var entityMap = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#39;',
  '/': '&#x2F;',
  '`': '&#x60;',
  '=': '&#x3D;'
};
var sanitizedContent = String(content).replace(/[&<>"'`=\/]/g, function (s) {
  return entityMap[s];
});
```

[1]: https://github.com/alphagov/tech-docs-gem/blob/66cc7ab0a06dc2f1fe89de8cba2270fcf46f6466/lib/assets/javascripts/_modules/search.js#L202-L204
[2]: http://api.jquery.com/text/#text2
[3]: http://api.jquery.com/detach/
[4]: https://github.com/janl/mustache.js/blob/972fd2b27a036888acfcb60d6119317744fac7ee/mustache.js#L60-L75
ChrisBAshton added a commit that referenced this pull request Apr 11, 2023
Pages that are indexed in search results have their entire
contents indexed, including any HTML code snippets. These HTML
snippets would appear in the search results unsanitised, so it
was possible to render arbitrary HTML or run arbitrary scripts:

> ![script being invoked](https://user-images.githubusercontent.com/5111927/230888935-0367b598-eda7-4f67-afb5-799b41684ee3.png)
> ![HTML being rendered](https://user-images.githubusercontent.com/5111927/230888939-f0056edc-6955-4f10-8aee-c93414b1cb69.png)

This is a largely theoretical security issue; to exploit it, an
attacker would need to find a way of committing malicious code
to a page indexed by a site that uses tech-docs-gem (which are
typically not editable by untrusted users). Their code would
also be limited by the relatively short length that's rendered
in the corresponding search result. Nevertheless, the XSS would
then be triggerable by visiting a pre-constructed URL
(`/search/index.html?q=some+search+term`), which users could be
tricked into clicking on through social engineering.

This commit sanitises the HTML before rendering it to the page.
It does so whilst retaining the `<mark data-markjs="true">`
behaviour that highlights the search term in the result:

> ![sanitised HTML with highlights](https://user-images.githubusercontent.com/5111927/230888944-9aaf4920-cddd-43f9-8ef5-17f15785af73.png)

I've used jQuery's `text()` function for sanitisation, as that is
the approach used elsewhere in the project ([1]).

I did consider using native JavaScript (using the same approach as
in Mustache [2]) to avoid the jQuery dependency, but this itself may
contain bugs and would lead to having two sanitisation approaches to
maintain, so I opted against it. For future reference, the code in
this commit can be swapped out with:

```js
var entityMap = {
  '&': '&amp;',
  '<': '&lt;',
  '>': '&gt;',
  '"': '&quot;',
  "'": '&#39;',
  '/': '&#x2F;',
  '`': '&#x60;',
  '=': '&#x3D;'
};
var sanitizedContent = String(content).replace(/[&<>"'`=\/]/g, function (s) {
  return entityMap[s];
});
```

[1]: https://github.com/alphagov/tech-docs-gem/blob/66cc7ab0a06dc2f1fe89de8cba2270fcf46f6466/lib/assets/javascripts/_modules/search.js#L202-L204
[2]: https://github.com/janl/mustache.js/blob/972fd2b27a036888acfcb60d6119317744fac7ee/mustache.js#L60-L75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants