Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better abstraction for accessing needs data (to improve scaling) #1264

Closed
chrisjsewell opened this issue Sep 2, 2024 · 1 comment
Closed
Assignees

Comments

@chrisjsewell
Copy link
Member

chrisjsewell commented Sep 2, 2024

Currently, all needs are stored and accessed from a simple in-memory dict[str, dict] on the sphinx build environment, mapping the need ID to the need data item.

However, when considering large projects with ~100,000 needs or more this starts to become problematic for:

  1. the memory consumption of sphinx; especially when considering multi-processing, if every process requires a copy of this dict
  2. filtering of needs; having to run a Python for loop over all needs, then check on each data item, will be noticeably slow

Ideally we want an abstraction for accessing the needs, particularly in the write/analysis phase, such that:

  1. the abstraction hides to the user where/how the actual data is stored
  2. the abstraction offers additional methods for performant filtering on common queries (e.g. finding all needs with a certain status)

Some work was started on this in #1255, however, one big pain point is that currently the needs are "exposed" to users in a number of different ways (see below), principally:

  1. Directly, the stored dict[str, dict] (after processing)
  2. the stored dict[str, dict] converted to a list[dict]
  3. the stored dict[str, dict] converted to a list[dict], but also with all need parts "expanded" and appended to the list

so somehow, we need to account for these representations also, without back-breaking changes

This also relates to: #1238 and #1217, asking to expose needs in more places, and #1219 requesting better scaling filtering


Needs access life-cycle

  • reading/collection phase

    • retrieve cached (unprocessed) needs, or instantiate
    • add any external needs
    • before document read: remove all needs for that document
    • during document read: add needs from directives
    • after document read: update all needs for that document (with data such as sections)
    • cache (unprocessed) needs
  • after caching (unprocessed) needs and before write/analysis phase

    • process needs (resolving dynamic values, computing backlinks etc)
  • write/analysis phase

    • retrieve views of the processed needs (to be filtered, read, etc)

Places where needs are exposed to the user

In configuration:

  • needs_warnings (when string): list of needs, no parts (to process_warnings)
  • needs_builder_filter: list of needs, no parts (to filter_needs)

In roles:

  • need_count:
    • content: list of needs + parts (to filter_needs)
  • need_func:
    • replacement of dynamic functions: map of id to need, no parts

In directives:

  • need directives:
    • resolution of dynamic values containing dynamic functions: map of id to unprocessed need, no parts
    • replacement of dynamic functions in content and style option: map of id to need, no parts
  • needbar:
    • content lines: list of needs + parts (to filter_needs)
  • needextend:
    • argument: list of unmodified needs, no parts (to filter_needs)
  • needextract:
    • content / filter-func: list of needs + parts (to process_filters)
  • needfilter:
    • content / filter-func: list of needs + parts (to process_filters)
  • needflow:
  • needgantt:
    • content / filter-func: list of needs + parts (to process_filters)
  • needlist:
    • content / filter-func: list of needs + parts (to process_filters)
  • needpie:
    • content lines / filter-func: list of needs + parts (to filter_needs)
  • needsequence:
    • filter: list of needs, no parts (to filter_single_need)
  • needtable:
    • content / filter-func: list of needs + parts (to process_filters)
    • replacement of dynamic functions in style_row option: map of id to need, no parts
  • needuml:
    • content: map of id to need, no parts (to jinja context)
    • filter jinja function: list of needs, no parts (to filter_needs)
@chrisjsewell chrisjsewell self-assigned this Sep 2, 2024
chrisjsewell added a commit that referenced this issue Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Sep 4, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Sep 5, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Sep 6, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Sep 6, 2024
Make it clearer where we are using an list of the "expanded" needs+parts.

As discussed in #1264, there are a number of different representations of the needs,
and so this makes it clearer which one a variable is.
chrisjsewell added a commit that referenced this issue Oct 1, 2024
This commit addresses issues in performance scalability of needs filtering:

- In a known user project, with ~40,000 needs, there are 5316 individual filter calls, across directives such as `needarch`, `needtable`, `needpie`, etc. Each takes on average 0.33 seconds, cumulative totalling **1751 seconds**
- Applying this PR reduces the average time to 0.033 seconds, cumulatively totalling **181 seconds**

----

The main issue with filtering (a.k.a. querying) needs, is that it requires looping over every need and executing a Python evaluation of the filter string (e.g. `id == "xxx"`), which scales with `O(N)` time.

To reduce this to `O(1)` time, for common patterns, we look to do two things:

1. More tightly control access to needs data across its lifecycle (see #1264): 
   By making needs data strictly immutable/read-only during the write/analysis phase (after it has been fully collected and post-processed), and moving access behind an abstract interface (as opposed to a standard dictionary),
   we can perform indexing on standard need fields, making value lookups `O(1)`

2. Analysing the filter string for common patterns (by parsing/analysing the Python AST), we can then utilise *index lookups* rather than *row scans* in these cases (akin to [SQL query plans](https://sqlite.org/eqp.html)).

---

One complication in creating the indexes and abstract interfaces, is that within the code base there are essentially two ways of accessing needs data:

- As a mapping of need ID to need data (now `NeedsView`)
- As a list of both needs and "expanded" parts (now `NeedsAndPartsListView`)

Particularly when it comes to e.g. `id == "xxx"`, this then has different meanings, as the former is only filtering for need IDs and the latter is filtering for both Need IDs and part IDs

---

It is of note that this will be breaking for any projects that currently attempts to mutate needs data within the write phase.
This has been mitigated by the addition of two sphinx events, which give more *precise* control for this use case:

- `needs-before-post-processing`: callbacks `func(app, needs)` are called just before the needs are post-processed (e.g. processing dynamic functions and back links)
- `needs-before-sealing`: callbacks `func(app, needs)` just after post-processing, before the needs are changed to read-only

Additionally, `env.needs_all_needs` has been replaced with `env._needs_all_needs`, since this "raw" data-structure should not be accessed directly.

The `get_needs_view` function has been added to the `sphinx_needs.api` to mitigate this, and it should perhaps be made clearer that users should NOT be accessing any `sphinx_needs` API outside this module.
@chrisjsewell
Copy link
Member Author

closed in #1281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant