Better abstraction for accessing needs data (to improve scaling) #1264

chrisjsewell · 2024-09-02T08:36:50Z

Currently, all needs are stored and accessed from a simple in-memory dict[str, dict] on the sphinx build environment, mapping the need ID to the need data item.

However, when considering large projects with ~100,000 needs or more this starts to become problematic for:

the memory consumption of sphinx; especially when considering multi-processing, if every process requires a copy of this dict
filtering of needs; having to run a Python for loop over all needs, then check on each data item, will be noticeably slow

Ideally we want an abstraction for accessing the needs, particularly in the write/analysis phase, such that:

the abstraction hides to the user where/how the actual data is stored
the abstraction offers additional methods for performant filtering on common queries (e.g. finding all needs with a certain status)

Some work was started on this in #1255, however, one big pain point is that currently the needs are "exposed" to users in a number of different ways (see below), principally:

Directly, the stored dict[str, dict] (after processing)
the stored dict[str, dict] converted to a list[dict]
the stored dict[str, dict] converted to a list[dict], but also with all need parts "expanded" and appended to the list

so somehow, we need to account for these representations also, without back-breaking changes

This also relates to: #1238 and #1217, asking to expose needs in more places, and #1219 requesting better scaling filtering

Needs access life-cycle

reading/collection phase
- retrieve cached (unprocessed) needs, or instantiate
- add any external needs
- before document read: remove all needs for that document
- during document read: add needs from directives
- after document read: update all needs for that document (with data such as sections)
- cache (unprocessed) needs
after caching (unprocessed) needs and before write/analysis phase
- process needs (resolving dynamic values, computing backlinks etc)
write/analysis phase
- retrieve views of the processed needs (to be filtered, read, etc)

Places where needs are exposed to the user

In configuration:

needs_warnings (when string): list of needs, no parts (to process_warnings)
needs_builder_filter: list of needs, no parts (to filter_needs)

In roles:

need_count:
- content: list of needs + parts (to filter_needs)
need_func:
- replacement of dynamic functions: map of id to need, no parts

In directives:

need directives:
- resolution of dynamic values containing dynamic functions: map of id to unprocessed need, no parts
- replacement of dynamic functions in content and style option: map of id to need, no parts
needbar:
- content lines: list of needs + parts (to filter_needs)
needextend:
- argument: list of unmodified needs, no parts (to filter_needs)
needextract:
- content / filter-func: list of needs + parts (to process_filters)
needfilter:
- content / filter-func: list of needs + parts (to process_filters)
needflow:
- content / filter-func: list of needs + parts (to process_filters)
- highlight: list of needs, no parts (to filter_single_need)
  - NOTE: missed for graphviz (fixed in 🐛 Pass needs to highlight filter of graphviz needflow #1274)
needgantt:
- content / filter-func: list of needs + parts (to process_filters)
needlist:
- content / filter-func: list of needs + parts (to process_filters)
needpie:
- content lines / filter-func: list of needs + parts (to filter_needs)
needsequence:
- filter: list of needs, no parts (to filter_single_need)
needtable:
- content / filter-func: list of needs + parts (to process_filters)
- replacement of dynamic functions in style_row option: map of id to need, no parts
needuml:
- content: map of id to need, no parts (to jinja context)
- filter jinja function: list of needs, no parts (to filter_needs)

The text was updated successfully, but these errors were encountered:

Make it clearer where we are using an list of the "expanded" needs+parts. As discussed in #1264, there are a number of different representations of the needs, and so this makes it clearer which one a variable is.

This commit addresses issues in performance scalability of needs filtering: - In a known user project, with ~40,000 needs, there are 5316 individual filter calls, across directives such as `needarch`, `needtable`, `needpie`, etc. Each takes on average 0.33 seconds, cumulative totalling **1751 seconds** - Applying this PR reduces the average time to 0.033 seconds, cumulatively totalling **181 seconds** ---- The main issue with filtering (a.k.a. querying) needs, is that it requires looping over every need and executing a Python evaluation of the filter string (e.g. `id == "xxx"`), which scales with `O(N)` time. To reduce this to `O(1)` time, for common patterns, we look to do two things: 1. More tightly control access to needs data across its lifecycle (see #1264): By making needs data strictly immutable/read-only during the write/analysis phase (after it has been fully collected and post-processed), and moving access behind an abstract interface (as opposed to a standard dictionary), we can perform indexing on standard need fields, making value lookups `O(1)` 2. Analysing the filter string for common patterns (by parsing/analysing the Python AST), we can then utilise *index lookups* rather than *row scans* in these cases (akin to [SQL query plans](https://sqlite.org/eqp.html)). --- One complication in creating the indexes and abstract interfaces, is that within the code base there are essentially two ways of accessing needs data: - As a mapping of need ID to need data (now `NeedsView`) - As a list of both needs and "expanded" parts (now `NeedsAndPartsListView`) Particularly when it comes to e.g. `id == "xxx"`, this then has different meanings, as the former is only filtering for need IDs and the latter is filtering for both Need IDs and part IDs --- It is of note that this will be breaking for any projects that currently attempts to mutate needs data within the write phase. This has been mitigated by the addition of two sphinx events, which give more *precise* control for this use case: - `needs-before-post-processing`: callbacks `func(app, needs)` are called just before the needs are post-processed (e.g. processing dynamic functions and back links) - `needs-before-sealing`: callbacks `func(app, needs)` just after post-processing, before the needs are changed to read-only Additionally, `env.needs_all_needs` has been replaced with `env._needs_all_needs`, since this "raw" data-structure should not be accessed directly. The `get_needs_view` function has been added to the `sphinx_needs.api` to mitigate this, and it should perhaps be made clearer that users should NOT be accessing any `sphinx_needs` API outside this module.

chrisjsewell · 2024-10-08T09:49:25Z

closed in #1281

chrisjsewell self-assigned this Sep 2, 2024

chrisjsewell added the enhancement label Sep 2, 2024

chrisjsewell mentioned this issue Sep 4, 2024

🔧 Introduce NeedsPartsView type #1275

Closed

This was referenced Sep 4, 2024

🔧 split filter_needs func by needs type #1276

Merged

✨ Add get_needs_view function to public API #1277

Closed

chrisjsewell mentioned this issue Sep 5, 2024

♻️ Replace need dicts/lists with views (with fast filtering) #1281

Merged

chrisjsewell closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better abstraction for accessing needs data (to improve scaling) #1264

Better abstraction for accessing needs data (to improve scaling) #1264

chrisjsewell commented Sep 2, 2024 •

edited

Loading

chrisjsewell commented Oct 8, 2024

Better abstraction for accessing needs data (to improve scaling) #1264

Better abstraction for accessing needs data (to improve scaling) #1264

Comments

chrisjsewell commented Sep 2, 2024 • edited Loading

Needs access life-cycle

Places where needs are exposed to the user

chrisjsewell commented Oct 8, 2024

chrisjsewell commented Sep 2, 2024 •

edited

Loading