_source field goes missing from Elassandra after nodetool rebuild_index #347

pankajydv · 2020-06-11T12:18:14Z

Elassandra version:
elassandra-6.8.4.3

Plugins installed: []

JVM version (java -version):
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux CMSNextDB3871 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I have 4 datacenters of Cassandra and recently migrated to Elassandra. I did a nodetool rebuild_index recently and see lots of documents in Elassandra which don't have a corresponding record in Cassandra. All these documents don't have _source field.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Migrate existing Apache Cassandra v3.11.2.0 with around 40,000,000 documents, as per the steps shared in Elassandra documentation.
Created mapping for the fields to be indexed in ElasticSearch.
After 3-4 runs of nodetool rebuild_index --thread 16, got lots of documents without _source field and missing all fields except elasticsearch specific fields such as _id, _type, _index only

Please provide the following information:

elassandra logs (logs/system.logs or /var/lib/cassandra/system.log)
elasticsearch cluster state (curl http://localhost:9200/_cluster/state)
cassandra schema (cqlsh>DESC KEYSPACE <your_keyspace>)
cassandra gossip state (run: nodetool gossipinfo)

system.log
cluster_status.log
gossipinfo.log
keyspace.log

The text was updated successfully, but these errors were encountered:

pankajydv · 2020-06-11T12:20:12Z

Please note the issue looks similar to #244, but it doesn't have a resolution.

pankajydv · 2020-06-11T12:25:17Z

Please also note the count, http://localhost:9200/cmsentitydb/_count is significantly different across the 4 datacenters i.e. 43406958, 43458440, 43451846, 35910790

vroyer · 2020-06-11T14:30:08Z

Such situation usually happen when a row is expired at the Cassandra level, but was indexed before being expired. For results with empty _source, please check the underlying row exist by issuing a SELECT * FROM table where PK = _id.

pankajydv · 2020-06-11T14:50:47Z

@vroyer No the record doesn't exist on underlying Cassandra table. Actually that's the real issue we are getting wrong results from elassandra index and there are just too many such records. in elassandra index. Is there a way to get rid of all such documents from Elassandra?

vroyer · 2020-06-11T15:28:31Z

In that situation, you should delete the index, and re-create it to only index existing rows, or (2nd scenario) create a new index, and switch using an ES index alias.
(index rebuild does not delete documents, it just reindex rows from SSTables on disk).

Just keep in mind that cassandra trigger a single-thread index build when the first index is created. So, in the 1rst scenario, if you want to rebuild quickly, you'll need on each node to kill the single-thread index rebuild (nodetool compactionstats + nodetool stop --compaction_id xxxx) and relauch a nodetool index_rebuild --threads 16 .... And in the second scenario, you'll need to launch the index rebuild...

pankajydv · 2020-06-11T16:10:19Z

@vroyer - Thanks for the quick response. The first approach is not an option for us because it's already being used in production. We'll got for the second approach.

But both of these approaches are time taking and don't resolve the issue quickly on production environment. It would be great if Elassandra can keep itself in sync with the Cassandra deletes, so that we don't face such issues on the live environment.

vroyer · 2020-06-11T21:28:39Z

Missing documents where probably removed by previous compactions. You can enable re-index on compaction to get the behaviour you expect, but it significantly increases cost of compaction, and it’s too late right now !

…

On 11 Jun 2020, at 18:10, Pankaj Yadav ***@***.***> wrote: @vroyer <https://github.com/vroyer> - Thanks for the quick response. The first approach is not an option for us because it's already being used in production. We'll got for the second approach. But both of these approaches are time taking and don't resolve the issue quickly on production environment. It would be great if Elassandra can keep itself in sync with the Cassandra deletes, so that we don't face such issues on the live environment. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#347 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOMPGJLMWP7SFHZTWWAND3RWD6XVANCNFSM4N3LQHYQ>.

pankajydv · 2020-06-12T16:35:37Z

How can I achieve this any references would help: 'enable re-index on compaction'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_source field goes missing from Elassandra after nodetool rebuild_index #347

_source field goes missing from Elassandra after nodetool rebuild_index #347

pankajydv commented Jun 11, 2020

pankajydv commented Jun 11, 2020

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020 •

edited

Loading

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020 via email

pankajydv commented Jun 12, 2020

_source field goes missing from Elassandra after nodetool rebuild_index #347

_source field goes missing from Elassandra after nodetool rebuild_index #347

Comments

pankajydv commented Jun 11, 2020

pankajydv commented Jun 11, 2020

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020 • edited Loading

pankajydv commented Jun 11, 2020

vroyer commented Jun 11, 2020 via email

pankajydv commented Jun 12, 2020

vroyer commented Jun 11, 2020 •

edited

Loading