[FEATURE] Collect user telemetry to support QE/EC metrics #125

thatplguy · 2024-10-18T19:25:49Z

Summary

Instrument CN, the CN error state output, and/or the VSCode plugin to support QE/EC quantitative metrics.

Feature

Identify where/how to modify our tools (or build new tools if necessary, e.g. an HTTP server) to collect the metrics identified in Tables 3-2, 3-3, and 3-4 in this doc:
https://docs.google.com/document/d/17n3E-tPR5cUZr_YSGspJlL3DrP7bGlpkglyrBWSgQJ4/edit

Also identify metrics that are difficult/impossible to measure. We can push back against collecting metrics that are too difficult to collect.

Acceptance Criteria

Running VERSE tools produces a file on the local file system containing user telemetry data.

Do

Short design doc explaining what changes need to be made to support these metrics
Implement metrics collection
Test metrics collection
[stretch] Add tests to CI

samcowger · 2024-11-08T19:12:14Z

For reference, it seems like there exists a more up-to-date version of these requirements here. I'll probably be using that to inform a plan for telemetry collection.

samcowger · 2024-11-18T18:37:07Z

Here is a sketch of what kind of telemetry I think we want to collect, and how we might collect it:

What to Collect

We will solicit information on each user and their background, including:

A name, and/or a private unique identifier, depending on privacy need
Area(s) of expertise, chosen from a set of reasonably granular options that are tailored to the user base under evaluation
Level of expertise per area - something like a numeric or Likert scale should suffice

We will record timestamped language server lifecycle events, including:

Startup
Configuration/reconfiguration of preferences
Shutdown/disconnect

We will record timestamped CN interactions, including:

What CN functionality is invoked
- e.g. well-formedness, verification, test generation, test execution
Where it's invoked
- e.g. the file in which it's invoked, and if applicable, the name of the function in that file
How it's invoked
- e.g. via a menu item, via a code lens, on document save, at command line
The outcome of the interaction
- e.g. success, partial success/failure (k out of n tests passed), failure due to user error, failure due to tool error

We will generate and record an identifier for the session in which a user works with our tools. The logic for determining the boundaries of a session seems orthogonal to the practice of assigning an identifier to a session. I'll assume, wlog, that a session begins when a VSCode editor is opened or when a user first executes CN at the command line, and ends when the VSCode window is closed, or after some set amount of time (say, until midnight).

We will record, or be able to derive after the fact, the particular task associated with a given action. The logic for determining the scope of a task will inform whether we record it or derive it after the fact. For instance, if a task is "verify function foo in file bar.c," we could derive what actions belong to that task after the fact by finding the first interaction and the last (successful) verification of foo, and declaring that all interactions within those bounds belong to that task.

This telemetry should be sufficient to meet all "Usability" metrics, as defined in the document I linked above, except for those metrics listed below. The "FM - Tooling" metrics all seem to be inherent properties of CN itself, rather than user interactions with it, so I do not intend to address them with the facilities I describe here.

How to Collect

Telemetry collection can be implemented as relatively simple structured logging. We'll implement telemetry collection as a standalone OCaml package that will amount to a special case of structured logging. Both CN itself and the CN language server will use this library.

If it proves flexible enough for our purposes, the tool will maintain its telemetry state on disk, including a user's profile information, the events it has logged, and its current session. This will support interleaving events that come from VSCode, the command line, and any other sources. If it doesn't - for instance, if it proves difficult to export data after the fact, or if multiple distinct machines need to target the same session or task - an alternative approach would be to stand up a simple webserver that can collect and store telemetry. To support flexibility of a particular recording approach, the tool's implementation will take care to maintain an abstraction over serialization behavior.

It would be beneficial to record events and other data in a human-readable format like JSON - it's simple to implement, it doesn't limit the universe of potential consumers of the data, and it seems valuable to offer the user a chance to view (and possibly edit) the information these tools collect. If it proves necessary, we could probably choose a more space-efficient representation in addition to, or instead of, JSON.

The tool will offer facilities for:

Constructing and recording individual telemetry events. The schema of these events will be kept maximally flexible to support the potential future addition of event types.
Recording and possibly updating persistent user profile information. See above for the sorts of information we aim to collect. This data will be treated as more immutable by the tool than event telemetry.
Packaging a set of telemetry events for easy export.

Recording telemetry based on user interactions with VSCode is straightforward. The language server protocol already defines messages for most events we'd find meaningful, so we can include telemetry-recording functionality in the existing message-handling logic. That protocol is also extensible, so we can define and send custom messages for any other events we want to record.

Recording telemetry from CN command-line interactions ought to be even simpler, as CN internals have access to far more information than CN clients.

Recording telemetry in a "VCA" environment ought to be simple enough. The possible pain points are determining whether a given VCA session ought to be the same as a given user-machine session, and likewise for tasks. This could be made easier by switching to a server-based collection approach, rather than on-disk serialization, as described above.

Other Metrics

Some metrics remain unmeasurable, or less measurable, with this telemetry approach.

Frequency of VDE generated tests or CN annotations in committed code base

This is not easily measurable by real-time telemetry. This is more easily measurable with a standalone tool that can analyze an entire codebase snapshot and calculate the proportions of interest. CN specs are easily recognizable to a simple parser, so calculating their frequency amounts to simple math. Generated tests may be less distinguishable from preexisting or user-defined tests, but can certainly be made distinguishable, if necessary.

Assurance created per time spent in VDE by [USER]
...
Assurance created per time spent in VDE by [EXPERTISE]

This relies on a coherent quantification of assurance, a tenuous notion. Assuming one exists, though, measuring it should be simple. A given task will admit a "maximum" amount of assurance, and user will achieve assurance whose quantity lies between zero and that maximum. We can determine the maximum as a static property of the task, and we should already be collecting enough telemetry to determine a user's achieved assurance, either in real time or via a post-hoc analysis.

Proportion of tasks resolved without expert help

As we mention in the document, this is challenging to do well and automatically. The easiest way to do it manually would probably be to allow users to log telemetry that says that an expert helped them on their current task, or X task in the past.

Non-FM work productivity

Measuring this rigorously and in general would be very challenging. Easier would be to let users self-report their productivity gain/loss, periodically or at the end of a trial, on something like a numeric or Likert scale.

Times user accesses VDE or VCA documentation
...
If failure signal before documentation access, does it get resolved after documentation access

Our ability to do this depends on the source of the documentation, and how much we're willing to restrict users' access to it. One possibility, if documentation existed exclusively as webpages, would be to set up some in-browser telemetry collection that tracked e.g. focusing in/out of a page, and measure documentation access by the frequency and duration of focus. This may require recording telemetry via a server, as opposed to on-disk.

bcpierce00 · 2024-11-18T18:41:31Z

Not sure you two will have seen this...

…

---------- Forwarded message --------- From: Sam Cowger ***@***.***> Date: Mon, Nov 18, 2024 at 1:37 PM Subject: Re: [GaloisInc/VERSE-Toolchain] [FEATURE] Collect user telemetry to support QE/EC metrics (Issue #125) To: GaloisInc/VERSE-Toolchain ***@***.***> Cc: Subscribed ***@***.***> Here is a sketch of what kind of telemetry I think we want to collect, and how we might collect it: What to Collect We will solicit information on each user and their background, including: - A name, and/or a private unique identifier, depending on privacy need - Area(s) of expertise, chosen from a set of reasonably granular options that are tailored to the user base under evaluation - Level of expertise per area - something like a numeric or Likert scale should suffice We will record timestamped language server lifecycle events, including: - Startup - Configuration/reconfiguration of preferences - Shutdown/disconnect We will record timestamped CN interactions, including: - What CN functionality is invoked - e.g. well-formedness, verification, test generation, test execution - Where it's invoked - e.g. the file in which it's invoked, and if applicable, the name of the function in that file - How it's invoked - e.g. via a menu item, via a code lens, on document save, at command line - The outcome of the interaction - e.g. success, partial success/failure (k out of n tests passed), failure due to user error, failure due to tool error We will generate and record an identifier for the session in which a user works with our tools. The logic for determining the boundaries of a session seems orthogonal to the practice of assigning an identifier to a session. I'll assume, wlog, that a session begins when a VSCode editor is opened or when a user first executes CN at the command line, and ends when the VSCode window is closed, or after some set amount of time (say, until midnight). We will record, or be able to derive after the fact, the particular task associated with a given action. The logic for determining the scope of a task will inform whether we record it or derive it after the fact. For instance, if a task is "verify function foo in file bar.c," we could derive what actions belong to that task after the fact by finding the first interaction and the last (successful) verification of foo, and declaring that all interactions within those bounds belong to that task. This telemetry should be sufficient to meet all "Usability" metrics, as defined in the document I linked above <https://urldefense.com/v3/__https://docs.google.com/document/d/1RNFD7Hz7vPbKceR064JSbHx7Pz7K1Kd9V1o6gNES5Aw/edit?tab=t.0__;!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVpxVsOXR$>, except for those metrics listed below. The "FM - Tooling" metrics all seem to be inherent properties of CN itself, rather than user interactions with it, so I do not intend to address them with the facilities I describe here. How to Collect Telemetry collection can be implemented as relatively simple structured logging. We'll implement telemetry collection as a standalone OCaml package that will amount to a special case of structured logging. Both CN itself and the CN language server will use this library. If it proves flexible enough for our purposes, the tool will maintain its telemetry state on disk, including a user's profile information, the events it has logged, and its current session. This will support interleaving events that come from VSCode, the command line, and any other sources. If it doesn't - for instance, if it proves difficult to export data after the fact, or if multiple distinct machines need to target the same session or task - an alternative approach would be to stand up a simple webserver that can collect and store telemetry. To support flexibility of a particular recording approach, the tool's implementation will take care to maintain an abstraction over serialization behavior. It would be beneficial to record events and other data in a human-readable format like JSON - it's simple to implement, it doesn't limit the universe of potential consumers of the data, and it seems valuable to offer the user a chance to view (and possibly edit) the information these tools collect. If it proves necessary, we could probably choose a more space-efficient representation in addition to, or instead of, JSON. The tool will offer facilities for: - Constructing and recording individual telemetry events. The schema of these events will be kept maximally flexible to support the potential future addition of event types. - Recording and possibly updating persistent user profile information. See above for the sorts of information we aim to collect. This data will be treated as more immutable by the tool than event telemetry. - Packaging a set of telemetry events for easy export. Recording telemetry based on user interactions with VSCode is straightforward. The language server protocol already defines messages for most events we'd find meaningful, so we can include telemetry-recording functionality in the existing message-handling logic. That protocol is also extensible, so we can define and send custom messages for any other events we want to record. Recording telemetry from CN command-line interactions ought to be even simpler, as CN internals have access to far more information than CN clients. Recording telemetry in a "VCA" environment ought to be simple enough. The possible pain points are determining whether a given VCA session ought to be the same as a given user-machine session, and likewise for tasks. This could be made easier by switching to a server-based collection approach, rather than on-disk serialization, as described above. Other Metrics Some metrics remain unmeasurable, or less measurable, with this telemetry approach. Frequency of VDE generated tests or CN annotations in committed code base This is not easily measurable by real-time telemetry. This is more easily measurable with a standalone tool that can analyze an entire codebase snapshot and calculate the proportions of interest. CN specs are easily recognizable to a simple parser, so calculating their frequency amounts to simple math. Generated tests may be less distinguishable from preexisting or user-defined tests, but can certainly be made distinguishable, if necessary. Assurance created per time spent in VDE by [USER] ... Assurance created per time spent in VDE by [EXPERTISE] This relies on a coherent quantification of assurance, a tenuous notion. Assuming one exists, though, measuring it should be simple. A given task will admit a "maximum" amount of assurance, and user will achieve assurance whose quantity lies between zero and that maximum. We can determine the maximum as a static property of the task, and we should already be collecting enough telemetry to determine a user's achieved assurance, either in real time or via a post-hoc analysis. Proportion of tasks resolved without expert help As we mention in the document, this is challenging to do well and automatically. The easiest way to do it manually would probably be to allow users to log telemetry that says that an expert helped them on their current task, or X task in the past. Non-FM work productivity Measuring this rigorously and in general would be very challenging. Easier would be to let users self-report their productivity gain/loss, periodically or at the end of a trial, on something like a numeric or Likert scale. Times user accesses VDE or VCA documentation ... If failure signal before documentation access, does it get resolved after documentation access Our ability to do this depends on the source of the documentation, and how much we're willing to restrict users' access to it. One possibility, if documentation existed exclusively as webpages, would be to set up some in-browser telemetry collection that tracked e.g. focusing in/out of a page, and measure documentation access by the frequency and duration of focus. This may require recording telemetry via a server, as opposed to on-disk. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/GaloisInc/VERSE-Toolchain/issues/125*issuecomment-2483828479__;Iw!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVooFYGye$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABVQQC36XUUYPBKUXBVU4KL2BIXWXAVCNFSM6AAAAABQGSC4GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTHAZDQNBXHE__;!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVvRjhBSQ$> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

samcowger · 2024-11-18T18:50:16Z

Here is a sketch of what kind of telemetry I think we want to collect, and how we might collect it:

@kschleisman - if you get the chance, I'd appreciate it if you could take a quick look at my above thoughts on telemetry. I would be particularly interested to know if it seems compatible with your understanding of our user trials, or if I'm making any faulty assumptions about their content or format.

bcpierce00 · 2024-11-18T20:04:58Z

All this sounds reasonable.

Some further questions:

If we do some work on integrating the documentation pages directly into the VSCode interface, then we can gather a LOT of information about things like where people have questions (documentation hotspots), how often referring to the documentation gets people unstuck, etc.
Tight integration with the documentation is desirable anyway because we've talked about using the tutorial as a rich (perhaps even primary) and relatively stable source of examples on which to measure users' performance using VERSE.

kschleisman · 2024-11-19T15:50:22Z

Thanks for sharing this Sam! It looks good to me. I strongly agree with Benjamin's comment above about integrating documentation into the interface. I also think we should avoid collecting any PII in the logs like name, level of expertise, etc. We are in the middle of conducting 1:1 interviews with LM Aero developers who will likely/hopefully be our user base in Phase III and have collected demographic data on them via survey already. So as long as we link those users to their unique ID/username in VERSE we are all set. Same plan for Lynx folks. We are aiming to identify a small and consistent set of users in Phase II and III and track their usage of VERSE over a longer period of time, so we should know who the users are ahead of logging telemetry data.

thatplguy · 2024-11-22T18:33:14Z

This looks great @samcowger, thanks for putting it together.

A few thoughts on your writeup and subsequent comments:

Collecting telemetry from the docs is a good idea, but I would suggest deferring that work until after you finish the telemetry you proposed in this writeup.
I agree with the metrics you called out as difficult to measure. We can skip those for now, with the exception of "assurance created per time spent" – for that, let's pick a simple measure of level of assurance, e.g. "verified / tested / neither". We can consider extending this later to try to characterize the scope of the specification, e.g. memory safety, partial correctness, etc.

thatplguy added Galois Galois, Inc. hci Issues related to human-computer interaction labels Oct 18, 2024

thatplguy added this to the End of P1 milestone Oct 18, 2024

samcowger mentioned this issue Dec 17, 2024

telemetry: library for creating, storing, and reading events #137

Merged

samcowger mentioned this issue Jan 21, 2025

client, server: record basic telemetry events #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Collect user telemetry to support QE/EC metrics #125

[FEATURE] Collect user telemetry to support QE/EC metrics #125

thatplguy commented Oct 18, 2024

samcowger commented Nov 8, 2024

samcowger commented Nov 18, 2024

bcpierce00 commented Nov 18, 2024 via email

samcowger commented Nov 18, 2024

bcpierce00 commented Nov 18, 2024

kschleisman commented Nov 19, 2024

thatplguy commented Nov 22, 2024

[FEATURE] Collect user telemetry to support QE/EC metrics #125

[FEATURE] Collect user telemetry to support QE/EC metrics #125

Comments

thatplguy commented Oct 18, 2024

Summary

Feature

Acceptance Criteria

Do

samcowger commented Nov 8, 2024

samcowger commented Nov 18, 2024

What to Collect

How to Collect

Other Metrics

bcpierce00 commented Nov 18, 2024 via email

samcowger commented Nov 18, 2024

bcpierce00 commented Nov 18, 2024

kschleisman commented Nov 19, 2024

thatplguy commented Nov 22, 2024