Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition map error / Replica partition nodes not defined #329

Open
gmaz42 opened this issue Dec 28, 2020 · 20 comments
Open

Partition map error / Replica partition nodes not defined #329

gmaz42 opened this issue Dec 28, 2020 · 20 comments
Labels

Comments

@gmaz42
Copy link

gmaz42 commented Dec 28, 2020

Hi there,

today I noticed this error in the logs (must be logged via Printf or Infof), which looks quite worrysome:

Partition map error: Master partition nodes not defined for namespace `my-namespace`: 3280 out of 4096
Replica partition nodes not defined for namespace `my-namespace`: 6526 out of 4096
Partition map errors normally occur when the cluster has partitioned due to network anomaly or node crash, or is not configured properly. Refer to https://www.aerospike.com/docs/operations/configure for more information.
.
  1. any idea why it is not logged/returned as an error? Is the cluster currently in a perilous status?
  2. how can I further investigate further this problem?

The mentioned URL does not provide help to troubleshoot this specific issue and I could not find online similar reports.

@khaf
Copy link
Collaborator

khaf commented Dec 29, 2020

This is the built-in validation code for when a partition map is rebuilt after a cluster event (node up/down, partitioning, etc.) It means that the partition table was not fully set after tend, meaning that the client didn't catch up to the full change on that tend.
By itself, this should not be a problem, unless it happens over a continuous period of time.

@khaf khaf added the question label Dec 29, 2020
@gmaz42
Copy link
Author

gmaz42 commented Dec 30, 2020

Thanks for your reply @khaf; I can reliably reproduce this issue every time the Go application starts.

I cannot find a corresponding metric in https://www.aerospike.com/docs/reference/metrics/ and the error message seems to indicate there might be a problem on server side, do you know how to proceed to identify the issue on server side?

@khaf
Copy link
Collaborator

khaf commented Dec 30, 2020

This is unlikely to be a server issue, rather a client issue. Could you share some information regarding your server config, number of nodes, client policy and how you connect to the server? How many log lines correspond to this message? Is it a one off every time you start the client, or is it recurrent?

@gmaz42
Copy link
Author

gmaz42 commented Jan 4, 2021

I see it every time service is started, let me share more details later about client/server configuration later.

@khaf
Copy link
Collaborator

khaf commented Jan 4, 2021

Thanks, looking forward.

@gmaz42
Copy link
Author

gmaz42 commented Jan 5, 2021

The Aerospike in use is the one provided by current Aerospike AMI on AWS, asinfo and configuration follow:

asinfo 
1 :  node
     BB90DA2E694520B
2 :  statistics
     cluster_size=5;cluster_key=E34E1C215546;cluster_integrity=true;cluster_is_member=true;cluster_duplicate_nodes=null;cluster_clock_skew_stop_writes_sec=0;cluster_clock_skew_ms=0;cluster_clock_skew_outliers=null;uptime=1169939;system_free_mem_pct=85;system_swapping=false;heap_allocated_kbytes=2524046;heap_active_kbytes=2536544;heap_mapped_kbytes=2979840;heap_efficiency_pct=85;heap_site_count=35;objects=27073725;tombstones=0;tsvc_queue=0;info_queue=0;delete_queue=0;rw_in_progress=0;proxy_in_progress=0;tree_gc_queue=0;client_connections=21;heartbeat_connections=4;fabric_connections=96;heartbeat_received_self=0;heartbeat_received_foreign=18707838;reaped_fds=7;info_complete=19022348;demarshal_error=0;early_tsvc_client_error=0;early_tsvc_batch_sub_error=0;early_tsvc_udf_sub_error=0;batch_index_initiate=33091763;batch_index_queue=0:0,0:0;batch_index_complete=33085559;batch_index_error=0;batch_index_timeout=6204;batch_index_delay=0;batch_index_unused_buffers=45;batch_index_huge_buffers=0;batch_index_created_buffers=45;batch_index_destroyed_buffers=0;batch_initiate=0;batch_queue=0;batch_error=0;batch_timeout=0;scans_active=0;query_short_running=0;query_long_running=0;sindex_ucgarbage_found=0;sindex_gc_retries=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;paxos_principal=BB90DA2E694520B;migrate_allowed=true;migrate_partitions_remaining=0;fabric_bulk_send_rate=0;fabric_bulk_recv_rate=0;fabric_ctrl_send_rate=0;fabric_ctrl_recv_rate=0;fabric_meta_send_rate=0;fabric_meta_recv_rate=0;fabric_rw_send_rate=519;fabric_rw_recv_rate=289
3 :  features
     peers;cdt-list;cdt-map;cluster-stable;pipelining;geo;float;batch-index;replicas;replicas-all;replicas-master;replicas-prole;udf
4 :  partition-generation
     2031
5 :  build_time
     Fri Aug 10 00:22:11 UTC 2018
6 :  edition
     Aerospike Community Edition
7 :  version
     Aerospike Community Edition build 4.2.0.10
8 :  build
     4.2.0.10
9 :  services
     10.[REDACTED]:3000;10.[REDACTED]:3000;10.[REDACTED]:3000;10.[REDACTED]:3000
10 :  services-alumni
     ***
11 :  build_os
     el6

aerospike.conf:

# Aerospike database configuration file for deployments using mesh heartbeats.
service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    pidfile /var/run/aerospike/asd.pid
    proto-fd-max 15000
}
logging {
    # Log file must be an absolute path.
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}
network {
    service {
	address any
        port 3000
    }
    heartbeat {
        mode mesh
        port 3002 # Heartbeat port for this node.
        # List one or more other nodes, one ip-address & port per line:
        mesh-seed-address-port 10.[REDACTED] 3002
        mesh-seed-address-port 10.[REDACTED] 3002
        mesh-seed-address-port 10.[REDACTED] 3002
        mesh-seed-address-port 10.[REDACTED] 3002
        ## generated mesh configuration ends
        interval 250
        timeout 10
    }
    fabric {
	port 3001
    }
    info {
	port 3003
    }
}
namespace some_namespace {
    replication-factor 3
    memory-size 8G
    default-ttl 0000
    storage-engine device {
        device /dev/nvme0n1p1 /dev/xvdh
    }
    set myset1 {
        set-disable-eviction true
    }
    set myset2 {
        set-disable-eviction true
    }
}

@xqzhang2015
Copy link

Hi @khaf
One following up for your comment

By itself, this should not be a problem, unless it happens over a continuous period of time.

We ever encounter partition map issues(partition with nil node) during runtime more than once, which leads to Get/BatchGet failures.

Client.IsConnected() can't reflect if there is nil node for a partition. Correct me if my understanding is wrong. Is it possible to add a flag to indicate health status?

Then, setPartitions() will trigger partition map validation if updatePartitionMap is true and output Error level log if having any issue. But the later clstr.getPartitions().validate() will output Debug level log if err not nil. It may lead to the host process failing to catch up partition map error, like one node is down or not connected actually(not sure if proper case). Could we change it to Error level log too?

	if updatePartitionMap {
		clstr.setPartitions(partitionMap)
	}

	if err := clstr.getPartitions().validate(); err != nil {
		Logger.Debug("Error validating the cluster partition map after tend: %s", err.Error())
	}

BTW, I have a technical issue to consult: for the tending in waitTillStabilized() during aerospike client initing, most of the seed nodes will be removed and will try to add at the next round tending. What's the purpose for that? Or is there a design wiki/doc about the tend strategy?

Thanks ahead

@gmaz42
Copy link
Author

gmaz42 commented Jan 7, 2021

We ever encounter partition map issues(partition with nil node) during runtime more than once, which leads to Get/BatchGet failures.

Isn't this a different issue than the one reported here?

@xqzhang2015
Copy link

@gmazzotta
During starting, such error message is normal, because aerospike client needs multiple tend rounds to get cluster-info, like nodes, partition map.
But for me, it outputs during runtime, which would lead to access business data error.

@gmaz42
Copy link
Author

gmaz42 commented Jan 7, 2021

Is there a possibility here that the client is operating inconsistently for an initial time window? If yes, should the client withhold operations until such multiple tend rounds are complete?

@khaf
Copy link
Collaborator

khaf commented Jan 7, 2021

@xqzhang2015 Thanks for the report, I'll take care of the log change before the weekend. For the other issue during the first tend, I'll have to look.
@gmazzotta While I haven't been updating the issue here, I've been working actively to sort out your reported issue. Will update as soon as I have the fix.

@gmaz42
Copy link
Author

gmaz42 commented Jan 7, 2021

Ah, no problem, and glad you have enough information to reproduce the issue @khaf! Many thanks for your work.

I forgot to mention some details:

  • client policy: default policy with a single change ReplicaPolicy = MASTER_PROLES
  • connection to server: direct, but does not use DNS: the individual IP addresses of each node are provided to the client
	c, err := aero.NewClientWithPolicyAndHost(aero.NewClientPolicy(), nodeIPs...)
        [...]

	// set default policy for all reads without a specific policy
	c.DefaultPolicy = aero.NewPolicy()
	// allow reading from replica nodes instead of master only
	c.DefaultPolicy.ReplicaPolicy = aero.MASTER_PROLES
	// set default policy for all batch reads/writes without a specific policy
	c.DefaultBatchPolicy = aero.NewBatchPolicy()
	c.DefaultBatchPolicy.ReplicaPolicy = aero.MASTER_PROLES

@gmaz42
Copy link
Author

gmaz42 commented May 5, 2021

By the way, this is still happening (I see it when a service starts up):

"Partition map error: Master partition nodes not defined for namespace `some_namespace`: 1343 out of 4096\nReplica partition nodes not defined for namespace `my-namespace`: 1367 out of 4096\nPartition map errors normally occur when the cluster has partitioned due to network anomaly or node crash, or is not configured properly. Refer to https://www.aerospike.com/docs/operations/configure for more information.\n.

@khaf
Copy link
Collaborator

khaf commented Aug 9, 2021

This issue should have been mitigated significantly by the last release. Is it still happening?

@gmaz42
Copy link
Author

gmaz42 commented Aug 16, 2021

@khaf I am going to try and report back; question about the release: I get this error with go build:

go: errors parsing go.mod:
test-project/go.mod:7:2: require github.com/aerospike/aerospike-client-go: version "v5.3.0" invalid: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v5

Would you consider merging a PR that adds go.mod and go.sum to the project?
Also, the tag v5.3.0 does not match the version in CHANGELOG v4.5.2

@khaf
Copy link
Collaborator

khaf commented Aug 16, 2021

The project does have a go.mod but it is on the v5 branch. The master has been left on the old v4 branch for maintenance purposes only. You should import it via:

import "github.com/aerospike/aerospike-client-go/v5

@gmaz42
Copy link
Author

gmaz42 commented Aug 16, 2021

The project does have a go.mod but it is on the v5 branch. The master has been left on the old v4 branch for maintenance purposes only. You should import it via:

import "github.com/aerospike/aerospike-client-go/v5

Shall this be mirrored in README.md of both master and v5 branches?

@khaf
Copy link
Collaborator

khaf commented Aug 16, 2021

It is, right on the top.

@gmaz42
Copy link
Author

gmaz42 commented Aug 16, 2021

@khaf have not yet upgraded to use latest v5, but can confirm that with v4.5.2 it is still there:

Error validating the cluster partition map after tend: Master partition nodes not defined for namespace `some_namespace`: 2767 out of 4096\nReplica partition nodes not defined for namespace `some_namespace`: 2699\nPartition map errors normally occur when the cluster has partitioned due to network anomaly or node crash, or is not configured properly. Refer to https://www.aerospike.com/docs/operations/configure for more information.\n

(was using v3.1.0 before)

Will report back about v5.3.0 in a couple days.

@gmaz42
Copy link
Author

gmaz42 commented Aug 17, 2021

I will not be able to provide any result here because I just discovered that in order to use the v5 client it is necessary to upgrade Aerospike server to at least v4.9; any chance the corresponding logic (de5f719) will be backported to master (v4)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants