Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use/remove SqlaGroup.add_nodes/remove_nodess skip_orm flag? #5453

Open
chrisjsewell opened this issue Mar 16, 2022 · 7 comments · May be fixed by #6720
Open

Use/remove SqlaGroup.add_nodes/remove_nodess skip_orm flag? #5453

chrisjsewell opened this issue Mar 16, 2022 · 7 comments · May be fixed by #6720
Assignees
Labels
topic/orm topic/storage type/performance Issue related to how quickly AiiDA works

Comments

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 16, 2022

The skip_orm flag was added to these methods in b40b2ca and bced84e

However, it is not currently used anywhere in the code base:

  • For add_nodes, it was effectively superseded by StorageBackend.bulk_insert, for importing archives
  • For remove_nodes, I'm not sure if it was ever used (@sphuber?)

A place one might want to use it, is in aiida/cmdline/commands/cmd_group.py::group_remove_nodes.
I would note, though, that performance of bulk insertions/deletion on the session has been improved in sqlalchemy>=1.4. So it might be worth double-checking any perceived performance improvements

@chrisjsewell chrisjsewell added topic/orm type/performance Issue related to how quickly AiiDA works labels Mar 16, 2022
@sphuber
Copy link
Contributor

sphuber commented Mar 21, 2022

We added this back in the day because the performance of adding to groups is really bad. The raw SQL will often take order(s) of magnitude less compared to the call through our ORM. With the skip_orm flag this difference was significantly reduced, but we weren't sure if it would be bug-free, so decided not to expose it in the front-end ORM nor use it ourselves by default. I have used it in my scripts for the SDB and 3DD databases that often need to add tens of thousands of nodes to groups and without this flag it would take ages. I can run a quick test on v2.0 but I remember recently adding nodes without this and still being surprised by the time it took.

@sphuber
Copy link
Contributor

sphuber commented Mar 21, 2022

Yep, still a problem, skip_orm=True is 50 times faster.

In [1]: group = Group('test')

In [2]: group.store()
Out[2]: <Group: 'test' [type core], of user mail@sphuber.net>

In [3]: nodes = [Data().store() for _ in range(1000)]

In [4]: backend_nodes = [n.backend_entity for n in nodes]

In [5]: %timeit group.add_nodes(nodes)
1.94 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: group.clear()

In [7]: %timeit group.backend_entity.add_nodes(backend_nodes)
2.01 s ± 49.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: group.clear()

In [9]: %timeit group.backend_entity.add_nodes(backend_nodes, skip_orm=True)
37.9 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@chrisjsewell
Copy link
Member Author

Ok cheers, I think there may be a better way to do this though, in order to expose on the front end. Will look into it

@GeigerJ2
Copy link
Contributor

We'll be having a talented Master's student, who'll be looking into improving bulk SQL operations, such as what is outlined here. Have there, by any chance, been any developments on this topic in recent years?

@sphuber
Copy link
Contributor

sphuber commented Oct 24, 2024

Not really, I think the state of this conversation is still pretty accurate

@rabbull
Copy link
Contributor

rabbull commented Jan 21, 2025

TLDR: Appending the nodes one by one in SqlaGroup.add_nodes is the main reason why ORM mode is less effective. Flushing everytime a node is appended also contributes a little to the performance problem but is acceptable.

I analyzed the implementation of add_nodes method of Group. I feel the reason why the ORM approach is far less effective than the way using core API is becuase the ORM implementation append incoming node to dbnodes one by one, which may lead to len(nodes)-1 more database interactions compared to the core API implementation. I also suspect the repetitive session.flush() invokes may also contribute to the problem.

To confirm these hypotheses, I modified the function slightly and performed a benchmark.

Two optional parameters are added to SqlaGroup.add_nodes:

  • skip_flush: Controls whether session.flush() is called after each node addition, to measure the overhead of frequent flushes.
  • use_extend: Enables using dbnodes.extend(...) for batch addition instead of iteratively calling dbnodes.append(...).

The modified method can be found here for reference:

    def add_nodes(self, nodes, **kwargs):
        """Add a node or a set of nodes to the group.

        :note: all the nodes *and* the group itself have to be stored.

        :param nodes: a list of `BackendNode` instance to be added to this group

        :param kwargs:
            skip_orm: When the flag is on, the SQLA ORM is skipped and SQLA is used
            to create a direct SQL INSERT statement to the group-node relationship
            table (to improve speed).
        """
        from sqlalchemy.dialects.postgresql import insert
        from sqlalchemy.exc import IntegrityError

        super().add_nodes(nodes)
        skip_orm = kwargs.get('skip_orm', False)
        skip_flush = kwargs.get('skip_flush', False)
        use_extend=kwargs.get('use_extend', False)

        def check_node(given_node):
            """Check if given node is of correct type and stored"""
            if not isinstance(given_node, self.NODE_CLASS):
                raise TypeError(f'invalid type {type(given_node)}, has to be {self.NODE_CLASS}')

            if not given_node.is_stored:
                raise ValueError('At least one of the provided nodes is unstored, stopping...')

        with utils.disable_expire_on_commit(self.backend.get_session()) as session:
            if not skip_orm and not use_extend:
                # Get dbnodes here ONCE, otherwise each call to dbnodes will re-read the current value in the database
                dbnodes = self.model.dbnodes

                for node in nodes:
                    check_node(node)

                    # Use pattern as suggested here:
                    # http://docs.sqlalchemy.org/en/latest/orm/session_transaction.html#using-savepoint
                    try:
                        with session.begin_nested():
                            dbnodes.append(node.bare_model)
                            # this branch will find out how much overhead frequent flushes make
                            if not skip_flush:
                                session.flush()
                    except IntegrityError:
                        # Duplicate entry, skip
                        pass
            
            # this branch will check if using extend instead of append will be more efficient
            elif not skip_orm:
                dbnodes = self.model.dbnodes
                for node in nodes:
                    check_node(node)
                try:
                    with session.begin_nested():
                        dbnodes.extend([node.bare_model for node in nodes])
                except IntegrityError:
                    pass

            else:
                ins_dict = []
                for node in nodes:
                    check_node(node)
                    ins_dict.append({'dbnode_id': node.id, 'dbgroup_id': self.id})

                table = self.GROUP_NODE_CLASS.__table__
                ins = insert(table).values(ins_dict)
                session.execute(ins.on_conflict_do_nothing(index_elements=['dbnode_id', 'dbgroup_id']))

            # Commit everything as up till now we've just flushed
            if not session.in_nested_transaction():
                session.commit()

And the benchmark script is shown here:

@pytest.mark.benchmark(group='Group')
@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.parametrize('skip_orm', [True, False], ids=['skip_orm', 'with_orm'])
@pytest.mark.parametrize('skip_flush', [True, False], ids=['skip_flush', 'with_flush'])
@pytest.mark.parametrize('use_extend', [True, False], ids=['extend', 'append'])
def test_group_add_nodes(benchmark, skip_orm, skip_flush, use_extend):
    num_nodes = 100
    nodes = []
    for _ in range(num_nodes):
        node = orm.Data().store()
        nodes.append(node.backend_entity)

    def _setup():
        id = uuid.uuid4()
        group = orm.Group(label=id.hex).store().backend_entity
        return (group, nodes), {}

    def _run(group, nodes):
        group.add_nodes(nodes, skip_orm=skip_orm, skip_flush=skip_flush, use_extend=use_extend)

    benchmark.pedantic(_run, setup=_setup, rounds=10)

The result of the benchmark is:

----------------------------------------------------------------------------------------------- benchmark 'Group': 8 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                         Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_group_add_nodes[extend-with_flush-skip_orm]      39.0170 (1.0)       43.7703 (1.00)      41.6220 (1.00)      1.6271 (1.44)      41.9490 (1.02)      2.2244 (1.29)          4;0  24.0258 (1.00)         10           1
test_group_add_nodes[append-skip_flush-skip_orm]      39.6299 (1.02)      46.7606 (1.07)      41.6094 (1.0)       2.1486 (1.91)      41.0062 (1.0)       2.4379 (1.41)          1;1  24.0330 (1.0)          10           1
test_group_add_nodes[extend-skip_flush-with_orm]      39.7142 (1.02)      49.6585 (1.14)      44.3849 (1.07)      3.0640 (2.72)      44.1206 (1.08)      3.4940 (2.03)          3;0  22.5302 (0.94)         10           1
test_group_add_nodes[extend-skip_flush-skip_orm]      39.9105 (1.02)      51.6310 (1.18)      44.7775 (1.08)      3.8493 (3.42)      44.0110 (1.07)      5.0564 (2.93)          4;0  22.3326 (0.93)         10           1
test_group_add_nodes[append-with_flush-skip_orm]      40.0179 (1.03)      43.5916 (1.0)       41.6522 (1.00)      1.1261 (1.0)       41.4639 (1.01)      1.7236 (1.0)           3;0  24.0084 (1.00)         10           1
test_group_add_nodes[extend-with_flush-with_orm]      41.4839 (1.06)      49.8534 (1.14)      44.5094 (1.07)      2.4542 (2.18)      44.5201 (1.09)      2.8332 (1.64)          3;1  22.4671 (0.93)         10           1
test_group_add_nodes[append-skip_flush-with_orm]     141.6221 (3.63)     162.0362 (3.72)     152.1047 (3.66)      5.3543 (4.75)     152.2249 (3.71)      5.7713 (3.35)          2;0   6.5744 (0.27)         10           1
test_group_add_nodes[append-with_flush-with_orm]     151.8634 (3.89)     280.4761 (6.43)     179.1391 (4.31)     39.7534 (35.30)    162.6232 (3.97)     24.5717 (14.26)         1;1   5.5823 (0.23)         10           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The benchmarks confirm that the main bottleneck is appending nodes one by one in the ORM implementation, causing excessive database interactions. Frequent session.flush() calls add slight overhead but are not the primary issue. Using extend for batch operations significantly improves performance in ORM mode.

@rabbull
Copy link
Contributor

rabbull commented Jan 21, 2025

I understand the reason why append and flush is used is because it want to mimic the behavior of on_conflict_do_nothing, i.e., avoid inserting duplicated nodes to the relation table. If this is the case, I would say using skip_orm as a default implementation and remove the flag looks good. Over years no user encountered into bugs related to this, and the implementation itself also looks promising to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/orm topic/storage type/performance Issue related to how quickly AiiDA works
Projects
None yet
5 participants