-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Collect user telemetry to support QE/EC metrics #125
Comments
For reference, it seems like there exists a more up-to-date version of these requirements here. I'll probably be using that to inform a plan for telemetry collection. |
Here is a sketch of what kind of telemetry I think we want to collect, and how we might collect it: What to CollectWe will solicit information on each user and their background, including:
We will record timestamped language server lifecycle events, including:
We will record timestamped CN interactions, including:
We will generate and record an identifier for the session in which a user works with our tools. The logic for determining the boundaries of a session seems orthogonal to the practice of assigning an identifier to a session. I'll assume, wlog, that a session begins when a VSCode editor is opened or when a user first executes CN at the command line, and ends when the VSCode window is closed, or after some set amount of time (say, until midnight). We will record, or be able to derive after the fact, the particular task associated with a given action. The logic for determining the scope of a task will inform whether we record it or derive it after the fact. For instance, if a task is "verify function This telemetry should be sufficient to meet all "Usability" metrics, as defined in the document I linked above, except for those metrics listed below. The "FM - Tooling" metrics all seem to be inherent properties of CN itself, rather than user interactions with it, so I do not intend to address them with the facilities I describe here. How to CollectTelemetry collection can be implemented as relatively simple structured logging. We'll implement telemetry collection as a standalone OCaml package that will amount to a special case of structured logging. Both CN itself and the CN language server will use this library. If it proves flexible enough for our purposes, the tool will maintain its telemetry state on disk, including a user's profile information, the events it has logged, and its current session. This will support interleaving events that come from VSCode, the command line, and any other sources. If it doesn't - for instance, if it proves difficult to export data after the fact, or if multiple distinct machines need to target the same session or task - an alternative approach would be to stand up a simple webserver that can collect and store telemetry. To support flexibility of a particular recording approach, the tool's implementation will take care to maintain an abstraction over serialization behavior. It would be beneficial to record events and other data in a human-readable format like JSON - it's simple to implement, it doesn't limit the universe of potential consumers of the data, and it seems valuable to offer the user a chance to view (and possibly edit) the information these tools collect. If it proves necessary, we could probably choose a more space-efficient representation in addition to, or instead of, JSON. The tool will offer facilities for:
Recording telemetry based on user interactions with VSCode is straightforward. The language server protocol already defines messages for most events we'd find meaningful, so we can include telemetry-recording functionality in the existing message-handling logic. That protocol is also extensible, so we can define and send custom messages for any other events we want to record. Recording telemetry from CN command-line interactions ought to be even simpler, as CN internals have access to far more information than CN clients. Recording telemetry in a "VCA" environment ought to be simple enough. The possible pain points are determining whether a given VCA session ought to be the same as a given user-machine session, and likewise for tasks. This could be made easier by switching to a server-based collection approach, rather than on-disk serialization, as described above. Other MetricsSome metrics remain unmeasurable, or less measurable, with this telemetry approach.
This is not easily measurable by real-time telemetry. This is more easily measurable with a standalone tool that can analyze an entire codebase snapshot and calculate the proportions of interest. CN specs are easily recognizable to a simple parser, so calculating their frequency amounts to simple math. Generated tests may be less distinguishable from preexisting or user-defined tests, but can certainly be made distinguishable, if necessary.
This relies on a coherent quantification of assurance, a tenuous notion. Assuming one exists, though, measuring it should be simple. A given task will admit a "maximum" amount of assurance, and user will achieve assurance whose quantity lies between zero and that maximum. We can determine the maximum as a static property of the task, and we should already be collecting enough telemetry to determine a user's achieved assurance, either in real time or via a post-hoc analysis.
As we mention in the document, this is challenging to do well and automatically. The easiest way to do it manually would probably be to allow users to log telemetry that says that an expert helped them on their current task, or X task in the past.
Measuring this rigorously and in general would be very challenging. Easier would be to let users self-report their productivity gain/loss, periodically or at the end of a trial, on something like a numeric or Likert scale.
Our ability to do this depends on the source of the documentation, and how much we're willing to restrict users' access to it. One possibility, if documentation existed exclusively as webpages, would be to set up some in-browser telemetry collection that tracked e.g. focusing in/out of a page, and measure documentation access by the frequency and duration of focus. This may require recording telemetry via a server, as opposed to on-disk. |
Not sure you two will have seen this...
…---------- Forwarded message ---------
From: Sam Cowger ***@***.***>
Date: Mon, Nov 18, 2024 at 1:37 PM
Subject: Re: [GaloisInc/VERSE-Toolchain] [FEATURE] Collect user telemetry
to support QE/EC metrics (Issue #125)
To: GaloisInc/VERSE-Toolchain ***@***.***>
Cc: Subscribed ***@***.***>
Here is a sketch of what kind of telemetry I think we want to collect, and
how we might collect it:
What to Collect
We will solicit information on each user and their background, including:
- A name, and/or a private unique identifier, depending on privacy need
- Area(s) of expertise, chosen from a set of reasonably granular options
that are tailored to the user base under evaluation
- Level of expertise per area - something like a numeric or Likert scale
should suffice
We will record timestamped language server lifecycle events, including:
- Startup
- Configuration/reconfiguration of preferences
- Shutdown/disconnect
We will record timestamped CN interactions, including:
- What CN functionality is invoked
- e.g. well-formedness, verification, test generation, test execution
- Where it's invoked
- e.g. the file in which it's invoked, and if applicable, the name of
the function in that file
- How it's invoked
- e.g. via a menu item, via a code lens, on document save, at command
line
- The outcome of the interaction
- e.g. success, partial success/failure (k out of n tests passed),
failure due to user error, failure due to tool error
We will generate and record an identifier for the session in which a user
works with our tools. The logic for determining the boundaries of a session
seems orthogonal to the practice of assigning an identifier to a session.
I'll assume, wlog, that a session begins when a VSCode editor is opened or
when a user first executes CN at the command line, and ends when the VSCode
window is closed, or after some set amount of time (say, until midnight).
We will record, or be able to derive after the fact, the particular task
associated with a given action. The logic for determining the scope of a
task will inform whether we record it or derive it after the fact. For
instance, if a task is "verify function foo in file bar.c," we could derive
what actions belong to that task after the fact by finding the first
interaction and the last (successful) verification of foo, and declaring
that all interactions within those bounds belong to that task.
This telemetry should be sufficient to meet all "Usability" metrics, as
defined in the document I linked above
<https://urldefense.com/v3/__https://docs.google.com/document/d/1RNFD7Hz7vPbKceR064JSbHx7Pz7K1Kd9V1o6gNES5Aw/edit?tab=t.0__;!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVpxVsOXR$>,
except for those metrics listed below. The "FM - Tooling" metrics all seem
to be inherent properties of CN itself, rather than user interactions with
it, so I do not intend to address them with the facilities I describe here.
How to Collect
Telemetry collection can be implemented as relatively simple structured
logging. We'll implement telemetry collection as a standalone OCaml package
that will amount to a special case of structured logging. Both CN itself
and the CN language server will use this library.
If it proves flexible enough for our purposes, the tool will maintain its
telemetry state on disk, including a user's profile information, the events
it has logged, and its current session. This will support interleaving
events that come from VSCode, the command line, and any other sources. If
it doesn't - for instance, if it proves difficult to export data after the
fact, or if multiple distinct machines need to target the same session or
task - an alternative approach would be to stand up a simple webserver that
can collect and store telemetry. To support flexibility of a particular
recording approach, the tool's implementation will take care to maintain an
abstraction over serialization behavior.
It would be beneficial to record events and other data in a human-readable
format like JSON - it's simple to implement, it doesn't limit the universe
of potential consumers of the data, and it seems valuable to offer the user
a chance to view (and possibly edit) the information these tools collect.
If it proves necessary, we could probably choose a more space-efficient
representation in addition to, or instead of, JSON.
The tool will offer facilities for:
- Constructing and recording individual telemetry events. The schema of
these events will be kept maximally flexible to support the potential
future addition of event types.
- Recording and possibly updating persistent user profile information.
See above for the sorts of information we aim to collect. This data will be
treated as more immutable by the tool than event telemetry.
- Packaging a set of telemetry events for easy export.
Recording telemetry based on user interactions with VSCode is
straightforward. The language server protocol already defines messages for
most events we'd find meaningful, so we can include telemetry-recording
functionality in the existing message-handling logic. That protocol is also
extensible, so we can define and send custom messages for any other events
we want to record.
Recording telemetry from CN command-line interactions ought to be even
simpler, as CN internals have access to far more information than CN
clients.
Recording telemetry in a "VCA" environment ought to be simple enough. The
possible pain points are determining whether a given VCA session ought to
be the same as a given user-machine session, and likewise for tasks. This
could be made easier by switching to a server-based collection approach,
rather than on-disk serialization, as described above.
Other Metrics
Some metrics remain unmeasurable, or less measurable, with this telemetry
approach.
Frequency of VDE generated tests or CN annotations in committed code base
This is not easily measurable by real-time telemetry. This is more easily
measurable with a standalone tool that can analyze an entire codebase
snapshot and calculate the proportions of interest. CN specs are easily
recognizable to a simple parser, so calculating their frequency amounts to
simple math. Generated tests may be less distinguishable from preexisting
or user-defined tests, but can certainly be made distinguishable, if
necessary.
Assurance created per time spent in VDE by [USER]
...
Assurance created per time spent in VDE by [EXPERTISE]
This relies on a coherent quantification of assurance, a tenuous notion.
Assuming one exists, though, measuring it should be simple. A given task
will admit a "maximum" amount of assurance, and user will achieve assurance
whose quantity lies between zero and that maximum. We can determine the
maximum as a static property of the task, and we should already be
collecting enough telemetry to determine a user's achieved assurance,
either in real time or via a post-hoc analysis.
Proportion of tasks resolved without expert help
As we mention in the document, this is challenging to do well and
automatically. The easiest way to do it manually would probably be to allow
users to log telemetry that says that an expert helped them on their
current task, or X task in the past.
Non-FM work productivity
Measuring this rigorously and in general would be very challenging. Easier
would be to let users self-report their productivity gain/loss,
periodically or at the end of a trial, on something like a numeric or
Likert scale.
Times user accesses VDE or VCA documentation
...
If failure signal before documentation access, does it get resolved after
documentation access
Our ability to do this depends on the source of the documentation, and how
much we're willing to restrict users' access to it. One possibility, if
documentation existed exclusively as webpages, would be to set up some
in-browser telemetry collection that tracked e.g. focusing in/out of a
page, and measure documentation access by the frequency and duration of
focus. This may require recording telemetry via a server, as opposed to
on-disk.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/GaloisInc/VERSE-Toolchain/issues/125*issuecomment-2483828479__;Iw!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVooFYGye$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABVQQC36XUUYPBKUXBVU4KL2BIXWXAVCNFSM6AAAAABQGSC4GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTHAZDQNBXHE__;!!IBzWLUs!Xp7r1lP-sAbZGsPXt0OBiuMZaR05nf11d6nzyAUItF9Es3BU3UKYWx56HRsAG6a_GelQUHYAWTenUBgs-mppVvRjhBSQ$>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@kschleisman - if you get the chance, I'd appreciate it if you could take a quick look at my above thoughts on telemetry. I would be particularly interested to know if it seems compatible with your understanding of our user trials, or if I'm making any faulty assumptions about their content or format. |
All this sounds reasonable. Some further questions:
|
Thanks for sharing this Sam! It looks good to me. I strongly agree with Benjamin's comment above about integrating documentation into the interface. I also think we should avoid collecting any PII in the logs like name, level of expertise, etc. We are in the middle of conducting 1:1 interviews with LM Aero developers who will likely/hopefully be our user base in Phase III and have collected demographic data on them via survey already. So as long as we link those users to their unique ID/username in VERSE we are all set. Same plan for Lynx folks. We are aiming to identify a small and consistent set of users in Phase II and III and track their usage of VERSE over a longer period of time, so we should know who the users are ahead of logging telemetry data. |
This looks great @samcowger, thanks for putting it together. A few thoughts on your writeup and subsequent comments:
|
Summary
Instrument CN, the CN error state output, and/or the VSCode plugin to support QE/EC quantitative metrics.
Feature
Identify where/how to modify our tools (or build new tools if necessary, e.g. an HTTP server) to collect the metrics identified in Tables 3-2, 3-3, and 3-4 in this doc:
https://docs.google.com/document/d/17n3E-tPR5cUZr_YSGspJlL3DrP7bGlpkglyrBWSgQJ4/edit
Also identify metrics that are difficult/impossible to measure. We can push back against collecting metrics that are too difficult to collect.
Acceptance Criteria
Running VERSE tools produces a file on the local file system containing user telemetry data.
Do
The text was updated successfully, but these errors were encountered: