Use/remove `SqlaGroup.add_nodes/remove_nodes`s `skip_orm` flag? #5453

chrisjsewell · 2022-03-16T12:47:49Z

The skip_orm flag was added to these methods in b40b2ca and bced84e

However, it is not currently used anywhere in the code base:

For add_nodes, it was effectively superseded by StorageBackend.bulk_insert, for importing archives
For remove_nodes, I'm not sure if it was ever used (@sphuber?)

A place one might want to use it, is in aiida/cmdline/commands/cmd_group.py::group_remove_nodes.
I would note, though, that performance of bulk insertions/deletion on the session has been improved in sqlalchemy>=1.4. So it might be worth double-checking any perceived performance improvements

The text was updated successfully, but these errors were encountered:

sphuber · 2022-03-21T08:26:11Z

We added this back in the day because the performance of adding to groups is really bad. The raw SQL will often take order(s) of magnitude less compared to the call through our ORM. With the skip_orm flag this difference was significantly reduced, but we weren't sure if it would be bug-free, so decided not to expose it in the front-end ORM nor use it ourselves by default. I have used it in my scripts for the SDB and 3DD databases that often need to add tens of thousands of nodes to groups and without this flag it would take ages. I can run a quick test on v2.0 but I remember recently adding nodes without this and still being surprised by the time it took.

sphuber · 2022-03-21T08:58:03Z

Yep, still a problem, skip_orm=True is 50 times faster.

In [1]: group = Group('test')

In [2]: group.store()
Out[2]: <Group: 'test' [type core], of user mail@sphuber.net>

In [3]: nodes = [Data().store() for _ in range(1000)]

In [4]: backend_nodes = [n.backend_entity for n in nodes]

In [5]: %timeit group.add_nodes(nodes)
1.94 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: group.clear()

In [7]: %timeit group.backend_entity.add_nodes(backend_nodes)
2.01 s ± 49.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: group.clear()

In [9]: %timeit group.backend_entity.add_nodes(backend_nodes, skip_orm=True)
37.9 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

chrisjsewell · 2022-03-21T09:06:12Z

Ok cheers, I think there may be a better way to do this though, in order to expose on the front end. Will look into it

GeigerJ2 · 2024-10-22T09:26:50Z

We'll be having a talented Master's student, who'll be looking into improving bulk SQL operations, such as what is outlined here. Have there, by any chance, been any developments on this topic in recent years?

sphuber · 2024-10-24T06:31:40Z

Not really, I think the state of this conversation is still pretty accurate

rabbull · 2025-01-21T12:46:44Z

TLDR: Appending the nodes one by one in SqlaGroup.add_nodes is the main reason why ORM mode is less effective. Flushing everytime a node is appended also contributes a little to the performance problem but is acceptable.

I analyzed the implementation of add_nodes method of Group. I feel the reason why the ORM approach is far less effective than the way using core API is becuase the ORM implementation append incoming node to dbnodes one by one, which may lead to len(nodes)-1 more database interactions compared to the core API implementation. I also suspect the repetitive session.flush() invokes may also contribute to the problem.

To confirm these hypotheses, I modified the function slightly and performed a benchmark.

Two optional parameters are added to SqlaGroup.add_nodes:

skip_flush: Controls whether session.flush() is called after each node addition, to measure the overhead of frequent flushes.
use_extend: Enables using dbnodes.extend(...) for batch addition instead of iteratively calling dbnodes.append(...).

The modified method can be found here for reference:

    def add_nodes(self, nodes, **kwargs):
        """Add a node or a set of nodes to the group.

        :note: all the nodes *and* the group itself have to be stored.

        :param nodes: a list of `BackendNode` instance to be added to this group

        :param kwargs:
            skip_orm: When the flag is on, the SQLA ORM is skipped and SQLA is used
            to create a direct SQL INSERT statement to the group-node relationship
            table (to improve speed).
        """
        from sqlalchemy.dialects.postgresql import insert
        from sqlalchemy.exc import IntegrityError

        super().add_nodes(nodes)
        skip_orm = kwargs.get('skip_orm', False)
        skip_flush = kwargs.get('skip_flush', False)
        use_extend=kwargs.get('use_extend', False)

        def check_node(given_node):
            """Check if given node is of correct type and stored"""
            if not isinstance(given_node, self.NODE_CLASS):
                raise TypeError(f'invalid type {type(given_node)}, has to be {self.NODE_CLASS}')

            if not given_node.is_stored:
                raise ValueError('At least one of the provided nodes is unstored, stopping...')

        with utils.disable_expire_on_commit(self.backend.get_session()) as session:
            if not skip_orm and not use_extend:
                # Get dbnodes here ONCE, otherwise each call to dbnodes will re-read the current value in the database
                dbnodes = self.model.dbnodes

                for node in nodes:
                    check_node(node)

                    # Use pattern as suggested here:
                    # http://docs.sqlalchemy.org/en/latest/orm/session_transaction.html#using-savepoint
                    try:
                        with session.begin_nested():
                            dbnodes.append(node.bare_model)
                            # this branch will find out how much overhead frequent flushes make
                            if not skip_flush:
                                session.flush()
                    except IntegrityError:
                        # Duplicate entry, skip
                        pass
            
            # this branch will check if using extend instead of append will be more efficient
            elif not skip_orm:
                dbnodes = self.model.dbnodes
                for node in nodes:
                    check_node(node)
                try:
                    with session.begin_nested():
                        dbnodes.extend([node.bare_model for node in nodes])
                except IntegrityError:
                    pass

            else:
                ins_dict = []
                for node in nodes:
                    check_node(node)
                    ins_dict.append({'dbnode_id': node.id, 'dbgroup_id': self.id})

                table = self.GROUP_NODE_CLASS.__table__
                ins = insert(table).values(ins_dict)
                session.execute(ins.on_conflict_do_nothing(index_elements=['dbnode_id', 'dbgroup_id']))

            # Commit everything as up till now we've just flushed
            if not session.in_nested_transaction():
                session.commit()

And the benchmark script is shown here:

@pytest.mark.benchmark(group='Group')
@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.parametrize('skip_orm', [True, False], ids=['skip_orm', 'with_orm'])
@pytest.mark.parametrize('skip_flush', [True, False], ids=['skip_flush', 'with_flush'])
@pytest.mark.parametrize('use_extend', [True, False], ids=['extend', 'append'])
def test_group_add_nodes(benchmark, skip_orm, skip_flush, use_extend):
    num_nodes = 100
    nodes = []
    for _ in range(num_nodes):
        node = orm.Data().store()
        nodes.append(node.backend_entity)

    def _setup():
        id = uuid.uuid4()
        group = orm.Group(label=id.hex).store().backend_entity
        return (group, nodes), {}

    def _run(group, nodes):
        group.add_nodes(nodes, skip_orm=skip_orm, skip_flush=skip_flush, use_extend=use_extend)

    benchmark.pedantic(_run, setup=_setup, rounds=10)

The result of the benchmark is:

----------------------------------------------------------------------------------------------- benchmark 'Group': 8 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                         Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_group_add_nodes[extend-with_flush-skip_orm]      39.0170 (1.0)       43.7703 (1.00)      41.6220 (1.00)      1.6271 (1.44)      41.9490 (1.02)      2.2244 (1.29)          4;0  24.0258 (1.00)         10           1
test_group_add_nodes[append-skip_flush-skip_orm]      39.6299 (1.02)      46.7606 (1.07)      41.6094 (1.0)       2.1486 (1.91)      41.0062 (1.0)       2.4379 (1.41)          1;1  24.0330 (1.0)          10           1
test_group_add_nodes[extend-skip_flush-with_orm]      39.7142 (1.02)      49.6585 (1.14)      44.3849 (1.07)      3.0640 (2.72)      44.1206 (1.08)      3.4940 (2.03)          3;0  22.5302 (0.94)         10           1
test_group_add_nodes[extend-skip_flush-skip_orm]      39.9105 (1.02)      51.6310 (1.18)      44.7775 (1.08)      3.8493 (3.42)      44.0110 (1.07)      5.0564 (2.93)          4;0  22.3326 (0.93)         10           1
test_group_add_nodes[append-with_flush-skip_orm]      40.0179 (1.03)      43.5916 (1.0)       41.6522 (1.00)      1.1261 (1.0)       41.4639 (1.01)      1.7236 (1.0)           3;0  24.0084 (1.00)         10           1
test_group_add_nodes[extend-with_flush-with_orm]      41.4839 (1.06)      49.8534 (1.14)      44.5094 (1.07)      2.4542 (2.18)      44.5201 (1.09)      2.8332 (1.64)          3;1  22.4671 (0.93)         10           1
test_group_add_nodes[append-skip_flush-with_orm]     141.6221 (3.63)     162.0362 (3.72)     152.1047 (3.66)      5.3543 (4.75)     152.2249 (3.71)      5.7713 (3.35)          2;0   6.5744 (0.27)         10           1
test_group_add_nodes[append-with_flush-with_orm]     151.8634 (3.89)     280.4761 (6.43)     179.1391 (4.31)     39.7534 (35.30)    162.6232 (3.97)     24.5717 (14.26)         1;1   5.5823 (0.23)         10           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The benchmarks confirm that the main bottleneck is appending nodes one by one in the ORM implementation, causing excessive database interactions. Frequent session.flush() calls add slight overhead but are not the primary issue. Using extend for batch operations significantly improves performance in ORM mode.

rabbull · 2025-01-21T13:04:37Z

I understand the reason why append and flush is used is because it want to mimic the behavior of on_conflict_do_nothing, i.e., avoid inserting duplicated nodes to the relation table. If this is the case, I would say using skip_orm as a default implementation and remove the flag looks good. Over years no user encountered into bugs related to this, and the implementation itself also looks promising to me.

chrisjsewell added topic/orm type/performance Issue related to how quickly AiiDA works labels Mar 16, 2022

agoscinski added the topic/storage label Oct 24, 2024

agoscinski assigned GeigerJ2 Oct 24, 2024

GeigerJ2 assigned rabbull Nov 12, 2024

rabbull linked a pull request Jan 21, 2025 that will close this issue

ORM: Use skip_orm as the default implementation for SqlaGroup.add_nodes and SqlaGroup.remove_nodes #6720

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use/remove `SqlaGroup.add_nodes/remove_nodes`s `skip_orm` flag? #5453

Use/remove `SqlaGroup.add_nodes/remove_nodes`s `skip_orm` flag? #5453

chrisjsewell commented Mar 16, 2022 •

edited

Loading

sphuber commented Mar 21, 2022

sphuber commented Mar 21, 2022 •

edited

Loading

chrisjsewell commented Mar 21, 2022

GeigerJ2 commented Oct 22, 2024

sphuber commented Oct 24, 2024

rabbull commented Jan 21, 2025

rabbull commented Jan 21, 2025

Use/remove SqlaGroup.add_nodes/remove_nodess skip_orm flag? #5453

Use/remove SqlaGroup.add_nodes/remove_nodess skip_orm flag? #5453

Comments

chrisjsewell commented Mar 16, 2022 • edited Loading

sphuber commented Mar 21, 2022

sphuber commented Mar 21, 2022 • edited Loading

chrisjsewell commented Mar 21, 2022

GeigerJ2 commented Oct 22, 2024

sphuber commented Oct 24, 2024

rabbull commented Jan 21, 2025

rabbull commented Jan 21, 2025

Use/remove `SqlaGroup.add_nodes/remove_nodes`s `skip_orm` flag? #5453

Use/remove `SqlaGroup.add_nodes/remove_nodes`s `skip_orm` flag? #5453

chrisjsewell commented Mar 16, 2022 •

edited

Loading

sphuber commented Mar 21, 2022 •

edited

Loading