Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestBPReorderingMergePolicy fails CheckIndex.testHnswGraph #14127

Open
msokolov opened this issue Jan 9, 2025 · 7 comments
Open

TestBPReorderingMergePolicy fails CheckIndex.testHnswGraph #14127

msokolov opened this issue Jan 9, 2025 · 7 comments

Comments

@msokolov
Copy link
Contributor

msokolov commented Jan 9, 2025

Description

TestBPReorderingMergePolicy > testReorderOnAddIndexes FAILED
    org.apache.lucene.index.CheckIndex$CheckIndexException: Field "vector" has repeated neighbors of node 6978 with value 7004
        at __randomizedtesting.SeedInfo.seed([B1FA21A9AAC314F2:A0D1EE47A3A0C942]:0)
        at app/[email protected]/org.apache.lucene.index.CheckIndex.testHnswGraph(CheckIndex.java:2931)
        at app/[email protected]/org.apache.lucene.index.CheckIndex.testHnswGraphs(CheckIndex.java:2806)
        at app/[email protected]/org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1125)
        at app/[email protected]/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:822)
        at app/[email protected]/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:592)
        at app/[email protected]/org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:338)
        at app/[email protected]/org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:921)
        at app/[email protected]/org.apache.lucene.util.IOUtils.close(IOUtils.java:85)
        at app/[email protected]/org.apache.lucene.util.IOUtils.close(IOUtils.java:72)
        at app//org.apache.lucene.misc.index.TestBPReorderingMergePolicy.testReorderOnAddIndexes(TestBPReorderingMergePolicy.java:241)

B1FA21A9AAC314F2

Gradle command to reproduce

./gradlew :lucene:misc:test --tests "org.apache.lucene.misc.index.TestBPReorderingMergePolicy" -Dtests.seed=B1FA21A9AAC314F2

@msokolov
Copy link
Contributor Author

msokolov commented Jan 9, 2025

Disabling HnswGraphBuilder.connectComponents makes the test case pass. I suspect that connectComponents is not careful about adding duplicates when making its new links bidirectional? It's benign, but the new CheckIndex check doesn't like it. We should probably (1) make CheckIndex more lenient, (2) fix connectComponents() so it doesn't create duplicates, and (3) re-enable the check, with an index version gate (since there may be indexes in the wild with duplicates now)

  void finish() throws IOException {
    // System.out.println("finish " + frozen);
    // connectComponents();
    frozen = true;
  }

@iverase
Copy link
Contributor

iverase commented Jan 20, 2025

In case useful, I could reduce the test to something like:

  public void testFastCommits() throws IOException {
    IndexWriterConfig config = newIndexWriterConfig().setMergePolicy(newTieredMergePolicy());
    try (Directory dir1 = newDirectory(); IndexWriter w1 = new IndexWriter(dir1, config)) {
      Document doc = new Document();
      KnnFloatVectorField vectorField = new KnnFloatVectorField("vector", new float[]{0});
      doc.add(vectorField);
      for (int i = 0; i < 1000; ++i) {
        int intValue = i % 2 == 0 ? 0 : i % 10;
        vectorField.setVectorValue(new float[]{intValue});
        w1.addDocument(doc);
        if (random().nextInt(4) == 0) {
          w1.commit();
        }
      }
    }
  }

This test fails ~5% of the times for me locally.

@msokolov
Copy link
Contributor Author

I spent some time trying to understand how this arose, and working on a fix, and I believe that the BP reordering exposed a pre-existing behavior in the component-merging code which could create these duplicates. Nothing really prevents it, but it didn't happen before, I'm not completely sure why, but my best theory is that because the way graphs are created we always add docs in docid order, but this is not true when reordering. I looked in to how to prevent the duplicates, and one thing we could do is to remove them when writing the graph in the codec (in Lucene99HnswVectorsWriter.writeGraph). This is a good place to do it because we sort the nodes there. Doing this in HnswGraphBuilder also is possible, but I think it would be less efficient because the neighbor nodes aren't sorted and any given node's neighbors might need to be checked multiple times.

@msokolov
Copy link
Contributor Author

Thanks for the test case @iverase, but I was able to reliably repro using the existing test case. These tests are pretty slow so I'd just as soon not add too many more, unless you think there is some new thing tested by that?

@iverase
Copy link
Contributor

iverase commented Jan 21, 2025

Thanks for the test case @iverase, but I was able to reliably repro using the existing test case. These tests are pretty slow so I'd just as soon not add too many more, unless you think there is some new thing tested by that?

I just wanted to prove that the issue is not related with the new stuff you added and can be reproduced with a very simple test. No need to commit it. It helped me to reproduce it with a lower number of nodes (I could reproduce it with just 96 documents) so I could study it better.

and I had another look and these are my observations:

In this line we create a list of components. If I understand correctly a component contains a list of nodes that are connected between them and it provides and entry node and the number of nodes it contains. I assume that two components are disconnected.

A bit below we tried to connect two components by making a search of the closest nodes from one component (c0) to the start of the other component (c1).

In theory that search should return two nodes from c0 that do not exist in c1. When it fails, it easy to prove that this search is returning neighbours from c1 when it shouldn't. Is that behaviour right?

@msokolov
Copy link
Contributor Author

The thing that complicates this is that the graph is directed - that is its links are not reciprocal always (although they mostly are), Therefore although it is true by construction of the components that a search of C0 will not find nodes in C1, the reverse is not necessarily true. Nodes in C0 may be reachable from nodes in C1.

@iverase
Copy link
Contributor

iverase commented Jan 21, 2025

One possible fix would be to remove neighbours from c1 from the search result which implies an extra graph seek to collect the neighbours into the visited nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants