Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: I can't Deploy a Milvus cluster in my k8s cluster #24607

Closed
1 task done
AdmondGuo opened this issue Jun 1, 2023 · 12 comments
Closed
1 task done

[Bug]: I can't Deploy a Milvus cluster in my k8s cluster #24607

AdmondGuo opened this issue Jun 1, 2023 · 12 comments
Assignees
Labels
help wanted Extra attention is needed kind/bug Issues or changes related a bug stale indicates no udpates for 30 days

Comments

@AdmondGuo
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.2x
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 4c16gx3
- GPU: 
- Others:

Current Behavior

I try to install milvus cluster by this page:https://milvus.io/docs/install_cluster-milvusoperator.md
I build a k8s cluster in aliyun.It contains 3 server, each has 4 cores 16G mem.
When I started install milvus, I found storageclass not found. So I set NFS as my defualt storageclass.
image
Then I try to start the deployment, but status still pending.

status:
  conditions:
  - lastTransitionTime: "2023-06-01T20:22:17Z"
    message: All etcd endpoints are unhealthy:[my-release-etcd.default:2379:context
      deadline exceeded]
    reason: EtcdNotReady
    status: "False"
    type: EtcdReady
  - lastTransitionTime: "2023-06-01T20:22:17Z"
    message: 'Get "http://my-release-minio.default:9000/minio/admin/v3/info": dial
      tcp 10.110.110.64:9000: connect: connection refused'
    reason: ClientError
    status: "False"
    type: StorageReady
  - lastTransitionTime: "2023-06-01T20:22:17Z"
    message: connection error
    reason: MsgStreamReady
    status: "False"
    type: MsgStreamReady
  - lastTransitionTime: "2023-06-01T20:22:18Z"
    message: 'dep[EtcdReady]: All etcd endpoints are unhealthy:[my-release-etcd.default:2379:context
      deadline exceeded];dep[StorageReady]: Get "http://my-release-minio.default:9000/minio/admin/v3/info":
      dial tcp 10.110.110.64:9000: connect: connection refused;dep[MsgStreamReady]:
      connection error;'
    reason: DependencyNotReady
    status: "False"
    type: MilvusReady
  - lastTransitionTime: "2023-06-01T20:22:17Z"
    message: Milvus components[rootcoord,datacoord,querycoord,indexcoord,datanode,querynode,indexnode,proxy]
      are updating
    reason: MilvusComponentsUpdating
    status: "False"
    type: MilvusUpdated
  endpoint: my-release-milvus.default:19530
  ingress:
    loadBalancer: {}
  observedGeneration: 1
  replicas: {}
  status: Pending

pods status:

NAME                                      READY   STATUS     RESTARTS   AGE
my-release-etcd-0                         0/1     Pending    0          27m
my-release-minio-58996444bf-9gmtg         0/1     Pending    0          27m
my-release-pulsar-bookie-0                0/1     Pending    0          39m
my-release-pulsar-bookie-1                0/1     Pending    0          27m
my-release-pulsar-broker-0                0/1     Init:0/2   0          26m
my-release-pulsar-proxy-0                 0/1     Init:0/2   0          26m
my-release-pulsar-zookeeper-0             0/1     Pending    0          39m
nfs-client-provisioner-7587f5bfdd-6kqj4   1/1     Running    0          57m
nginx-app-5c64488cdf-bjllq                1/1     Running    0          141m
nginx-app-5c64488cdf-c4pn5                1/1     Running    0          141m

What should I do to fix the error?

Expected Behavior

Deploy milvus correctly.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@AdmondGuo AdmondGuo added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 1, 2023
@yanliang567
Copy link
Contributor

/assign @locustbaby
/unassign

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 2, 2023
@locustbaby
Copy link
Contributor

hi @AdmondGuo
Could you please describe the pod and share the Events with us?
Do you have the storage class set to default?

@AdmondGuo
Copy link
Author

HI @locustbaby,
There are three Alibaba Cloud bare machines. And I build k8s on them.

root@host1:~# kubectl get nodes
NAME    STATUS   ROLES           AGE   VERSION
host1   Ready    control-plane   11h   v1.27.2
host2   Ready    <none>          9h    v1.27.2
host3   Ready    <none>          9h    v1.27.2

I tried to install milvus on k8s cluster, but I found I need StorageClass. So I build a Storage by nfs and set it as default.

root@host1:~# kubectl get sc
NAME                           PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
course-nfs-storage (default)   qgg-nfs-storage   Delete          Immediate           false                  8h

finally I install milvus, then get previous pods status.
How can I check my milvus error msg? Anywhere find logs?

@locustbaby
Copy link
Contributor

Sorry my bad,
I saw your most of your pods are in pending status,
Could you please get the status of them by cmd kubectl describe pod my-release-etcd-0

@AdmondGuo
Copy link
Author

@locustbaby
OK, there is the response.

root@host1:~# kubectl describe pod my-release-etcd-0
Name:             my-release-etcd-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app.kubernetes.io/instance=my-release-etcd
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=etcd
                  controller-revision-hash=my-release-etcd-6f7d4c456f
                  helm.sh/chart=etcd-6.3.3
                  statefulset.kubernetes.io/pod-name=my-release-etcd-0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/my-release-etcd
Containers:
  etcd:
    Image:       docker.io/milvusdb/etcd:3.5.5-r2
    Ports:       2379/TCP, 2380/TCP
    Host Ports:  0/TCP, 0/TCP
    Liveness:    exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
    Readiness:   exec [/opt/bitnami/scripts/etcd/healthcheck.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:                     false
      MY_POD_IP:                          (v1:status.podIP)
      MY_POD_NAME:                       my-release-etcd-0 (v1:metadata.name)
      ETCDCTL_API:                       3
      ETCD_ON_K8S:                       yes
      ETCD_START_FROM_SNAPSHOT:          no
      ETCD_DISASTER_RECOVERY:            no
      ETCD_NAME:                         $(MY_POD_NAME)
      ETCD_DATA_DIR:                     /bitnami/etcd/data
      ETCD_LOG_LEVEL:                    info
      ALLOW_NONE_AUTHENTICATION:         yes
      ETCD_ADVERTISE_CLIENT_URLS:        http://$(MY_POD_NAME).my-release-etcd-headless.default.svc.cluster.local:2379
      ETCD_LISTEN_CLIENT_URLS:           http://0.0.0.0:2379
      ETCD_INITIAL_ADVERTISE_PEER_URLS:  http://$(MY_POD_NAME).my-release-etcd-headless.default.svc.cluster.local:2380
      ETCD_LISTEN_PEER_URLS:             http://0.0.0.0:2380
      ETCD_AUTO_COMPACTION_MODE:         revision
      ETCD_AUTO_COMPACTION_RETENTION:    1000
      ETCD_QUOTA_BACKEND_BYTES:          4294967296
      ETCD_HEARTBEAT_INTERVAL:           500
      ETCD_ELECTION_TIMEOUT:             2500
    Mounts:
      /bitnami/etcd from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p92pn (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-my-release-etcd-0
    ReadOnly:   false
  kube-api-access-p92pn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m41s (x100 over 8h)  default-scheduler  0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..

@locustbaby
Copy link
Contributor

That seems there isn't PVC now,
Could you please get the PVCs by kubectl get pvc ?
And if those PVCs are in pending status, use kubectl describe pvc <pvc-name> to get the status

@AdmondGuo
Copy link
Author

Yes,PVCs are pending.

root@host1:~# kubectl get pvc
NAME                                                             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS         AGE
data-my-release-etcd-0                                           Pending                                      course-nfs-storage   36h
data-my-release-etcd-1                                           Pending                                      course-nfs-storage   36h
data-my-release-etcd-2                                           Pending                                      course-nfs-storage   36h
export-my-release-minio-0                                        Pending                                      course-nfs-storage   36h
export-my-release-minio-1                                        Pending                                      course-nfs-storage   36h
export-my-release-minio-2                                        Pending                                      course-nfs-storage   36h
export-my-release-minio-3                                        Pending                                      course-nfs-storage   36h
my-release-minio                                                 Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-journal-my-release-pulsar-bookie-0      Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-journal-my-release-pulsar-bookie-1      Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-journal-my-release-pulsar-bookie-2      Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-ledgers-my-release-pulsar-bookie-0      Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-ledgers-my-release-pulsar-bookie-1      Pending                                      course-nfs-storage   36h
my-release-pulsar-bookie-ledgers-my-release-pulsar-bookie-2      Pending                                      course-nfs-storage   36h
my-release-pulsar-zookeeper-data-my-release-pulsar-zookeeper-0   Pending                                      course-nfs-storage   36h
root@host1:~# kubectl describe pvc data-my-release-etcd-0
Name:          data-my-release-etcd-0
Namespace:     default
StorageClass:  course-nfs-storage
Status:        Pending
Volume:        
Labels:        app.kubernetes.io/instance=my-release-etcd
               app.kubernetes.io/name=etcd
Annotations:   volume.beta.kubernetes.io/storage-provisioner: qgg-nfs-storage
               volume.kubernetes.io/storage-provisioner: qgg-nfs-storage
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       my-release-etcd-0
Events:
  Type    Reason                Age                   From                         Message
  ----    ------                ----                  ----                         -------
  Normal  ExternalProvisioning  60s (x8557 over 35h)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "qgg-nfs-storage" or manually created by system administrator

@AdmondGuo
Copy link
Author

@locustbaby Sorry, but I still not fix the problem. I know not much about K8S. Could you please help me fix it?
Now there is no created PV.

root@host1:~# kubectl get pv
No resources found

I think that's why I found PVCs are all pending. And this is my StorageClass config:

root@host1:~# cat sc-1.yaml 
# 创建NFS资源的StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass # 创建StorageClass
metadata:
  name: managed-nfs-storage
provisioner: qgg-nfs-storage #这里的名称要和provisioner配置文件中的环境变量PROVISIONER_NAME保持一致
parameters:  
   archiveOnDelete: "false"

# 创建NFS provisioner
apiVersion: apps/v1
kind: Deployment # 部署nfs-client-provisioner
metadata:
  name: nfs-client-provisioner
  labels:
    app: nfs-client-provisioner
  namespace: default #与RBAC文件中的namespace保持一致
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nfs-client-provisioner
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: nfs-client-provisioner
  template:
    metadata:
      labels:
        app: nfs-client-provisioner
    spec:
      serviceAccountName: nfs-client-provisioner # 指定serviceAccount!
      containers:
        - name: nfs-client-provisioner
          image: hub.kaikeba.com/java12/nfs-client-provisioner:v1 #镜像地址
          volumeMounts: # 挂载数据卷到容器指定目录
            - name: nfs-client-root
              mountPath: /persistentvolumes
          env:
            - name: PROVISIONER_NAME # 配置provisioner的Name
              value: qgg-nfs-storage # 确保该名称与 StorageClass 资源中的provisioner名称保持一致
            - name: NFS_SERVER #绑定的nfs服务器
              value: #host eth0
            - name: NFS_PATH   #绑定的nfs服务器目录
              value: /opt/k8s
      volumes: # 申明nfs数据卷
        - name: nfs-client-root
          nfs:
            server: #host eth0
            path: /opt/k8s

Maybe the provisioner image url is wrong? How can I check it?
Could you please help me?

@locustbaby
Copy link
Contributor

@AdmondGuo Hi, the error waiting for a volume to be created, either by external provisioner "qgg-nfs-storage" or manually created by system administrator is a known issue for nfs in k8s, you can refer this and try to fix your nfs

@stale
Copy link

stale bot commented Jul 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jul 9, 2023
@stale stale bot closed this as completed Jul 16, 2023
@Syed-Faizal-S
Copy link

/reopen

facing the exact same issue

@sre-ci-robot
Copy link
Contributor

@Syed-Faizal-S: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

facing the exact same issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/bug Issues or changes related a bug stale indicates no udpates for 30 days
Projects
None yet
Development

No branches or pull requests

5 participants