-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better abstraction for accessing needs data (to improve scaling) #1264
Labels
Comments
chrisjsewell
added a commit
that referenced
this issue
Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
chrisjsewell
added a commit
that referenced
this issue
Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
chrisjsewell
added a commit
that referenced
this issue
Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
This was referenced Sep 4, 2024
chrisjsewell
added a commit
that referenced
this issue
Sep 5, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
chrisjsewell
added a commit
that referenced
this issue
Sep 6, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
chrisjsewell
added a commit
that referenced
this issue
Sep 6, 2024
Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.
chrisjsewell
added a commit
that referenced
this issue
Oct 1, 2024
This commit addresses issues in performance scalability of needs filtering: - In a known user project, with ~40,000 needs, there are 5316 individual filter calls, across directives such as `needarch`, `needtable`, `needpie`, etc. Each takes on average 0.33 seconds, cumulative totalling **1751 seconds** - Applying this PR reduces the average time to 0.033 seconds, cumulatively totalling **181 seconds** ---- The main issue with filtering (a.k.a. querying) needs, is that it requires looping over every need and executing a Python evaluation of the filter string (e.g. `id == "xxx"`), which scales with `O(N)` time. To reduce this to `O(1)` time, for common patterns, we look to do two things: 1. More tightly control access to needs data across its lifecycle (see #1264): By making needs data strictly immutable/read-only during the write/analysis phase (after it has been fully collected and post-processed), and moving access behind an abstract interface (as opposed to a standard dictionary), we can perform indexing on standard need fields, making value lookups `O(1)` 2. Analysing the filter string for common patterns (by parsing/analysing the Python AST), we can then utilise *index lookups* rather than *row scans* in these cases (akin to [SQL query plans](https://sqlite.org/eqp.html)). --- One complication in creating the indexes and abstract interfaces, is that within the code base there are essentially two ways of accessing needs data: - As a mapping of need ID to need data (now `NeedsView`) - As a list of both needs and "expanded" parts (now `NeedsAndPartsListView`) Particularly when it comes to e.g. `id == "xxx"`, this then has different meanings, as the former is only filtering for need IDs and the latter is filtering for both Need IDs and part IDs --- It is of note that this will be breaking for any projects that currently attempts to mutate needs data within the write phase. This has been mitigated by the addition of two sphinx events, which give more *precise* control for this use case: - `needs-before-post-processing`: callbacks `func(app, needs)` are called just before the needs are post-processed (e.g. processing dynamic functions and back links) - `needs-before-sealing`: callbacks `func(app, needs)` just after post-processing, before the needs are changed to read-only Additionally, `env.needs_all_needs` has been replaced with `env._needs_all_needs`, since this "raw" data-structure should not be accessed directly. The `get_needs_view` function has been added to the `sphinx_needs.api` to mitigate this, and it should perhaps be made clearer that users should NOT be accessing any `sphinx_needs` API outside this module.
closed in #1281 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, all needs are stored and accessed from a simple in-memory
dict[str, dict]
on the sphinx build environment, mapping the need ID to the need data item.However, when considering large projects with ~100,000 needs or more this starts to become problematic for:
for
loop over all needs, then check on each data item, will be noticeably slowIdeally we want an abstraction for accessing the needs, particularly in the write/analysis phase, such that:
Some work was started on this in #1255, however, one big pain point is that currently the needs are "exposed" to users in a number of different ways (see below), principally:
dict[str, dict]
(after processing)dict[str, dict]
converted to alist[dict]
dict[str, dict]
converted to alist[dict]
, but also with all need parts "expanded" and appended to the listso somehow, we need to account for these representations also, without back-breaking changes
This also relates to: #1238 and #1217, asking to expose needs in more places, and #1219 requesting better scaling filtering
Needs access life-cycle
reading/collection phase
sections
)after caching (unprocessed) needs and before write/analysis phase
write/analysis phase
Places where needs are exposed to the user
In configuration:
needs_warnings
(when string): list of needs, no parts (toprocess_warnings
)needs_builder_filter
: list of needs, no parts (tofilter_needs
)In roles:
need_count
:filter_needs
)need_func
:In directives:
style
option: map of id to need, no partsneedbar
:filter_needs
)needextend
:filter_needs
)needextract
:filter-func
: list of needs + parts (toprocess_filters
)needfilter
:filter-func
: list of needs + parts (toprocess_filters
)needflow
:filter-func
: list of needs + parts (toprocess_filters
)highlight
: list of needs, no parts (tofilter_single_need
)needs
tohighlight
filter ofgraphviz
needflow
#1274)needgantt
:filter-func
: list of needs + parts (toprocess_filters
)needlist
:filter-func
: list of needs + parts (toprocess_filters
)needpie
:filter-func
: list of needs + parts (tofilter_needs
)needsequence
:filter
: list of needs, no parts (tofilter_single_need
)needtable
:filter-func
: list of needs + parts (toprocess_filters
)style_row
option: map of id to need, no partsneeduml
:filter
jinja function: list of needs, no parts (tofilter_needs
)The text was updated successfully, but these errors were encountered: