Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hello_world can not run #112

Open
czq693497091 opened this issue Jan 27, 2022 · 7 comments
Open

hello_world can not run #112

czq693497091 opened this issue Jan 27, 2022 · 7 comments

Comments

@czq693497091
Copy link

I have installed the k8s (v1.18.2) in the local cluster and used helm(v2.17.0) to install adaptdl, adaptdl-sched successfully:

root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/mnist# kubectl get pod -A | grep adaptdl
adaptdl adaptdl-registry-697884b65-wf4w6 1/1 Running 0 17h
adaptdl jazzed-koala-adaptdl-sched-85d75fdb5d-9lvzq 3/3 Running 6 17h
adaptdl jazzed-koala-validator-98f8fcf7c-jj959 1/1 Running 0 17h
adaptdl peeking-ostrich-adaptdl-sched-667c78f9fb-fr2zj 3/3 Running 4 17h

and I write the hello_world protect the same as the introduction with the following structure:
└── hello_world
├── adaptdljob.yaml
├── Dockerfile
└── hello_world.py

I execute the "adaptdl submit hello_world" and get the following information:

/usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Using AdaptDL insecure registry.
Sending build context to Docker daemon 4.096kB
Step 1/4 : FROM python:3.7-slim
---> d3c9ad326043
Step 2/4 : RUN python3 -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple adaptdl
---> Using cache
---> 05dae174d67e
Step 3/4 : COPY hello_world.py /root/hello_world.py
---> Using cache
---> 10d12170490d
Step 4/4 : ENV PYTHONUNBUFFERED=true
---> Using cache
---> bc04efd29920
Successfully built bc04efd29920
Successfully tagged localhost:59283/adaptdl-submit:latest
Using default tag: latest
The push refers to repository [localhost:59283/adaptdl-submit]
2cab9519a560: Layer already exists
16f13637494a: Layer already exists
25ad0307b4c1: Layer already exists
874b45955cb1: Layer already exists
85c923303735: Layer already exists
d0fa20bfdce7: Layer already exists
2edcec3590a4: Layer already exists
latest: digest: sha256:7346ece45037f13481a30a50907418bbd460035f488a1aab3cfb0f8ebdf35644 size: 1790
W0126 21:25:38.652722 75926 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Unsupported storageclass from available storageclasses []

and I execute "adaptdl ls" but cannot get the information about this demo:
root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/HelloWorld# adaptdl ls
/usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
No adaptdljobs
Name Status Start(UTC) Runtime Rplc Rtrt

I wonder how to cope with this problem and the job can correctly execute.

@aurickq
Copy link
Contributor

aurickq commented Apr 17, 2022

Unsupported storageclass from available storageclasses []

It looks like your K8s might not have any storageclasses installed. AdaptDL requires a shared filesystem which can be used to store checkpoints and other information when a job is restarted. Once you have a storageclass for a shared filesystem installed, you can pass it into the submit command with --checkpoint-storage-class=....

@SHu0421
Copy link

SHu0421 commented May 2, 2022

Hello @aurickq ,
I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

@aurickq
Copy link
Contributor

aurickq commented May 26, 2022

@SHu0421 This error could be caused by a variety of reasons. You can start by checking kubectl -n <adaptdl namespace> get all (replacing <adaptdl namespace> with the namespace in which you installed the adaptdl scheduler).

@gudiandian
Copy link

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

@aurickq
Copy link
Contributor

aurickq commented Jun 16, 2022

@gudiandian it sounds like it's related to the problem you are having in #124

@SHu0421
Copy link

SHu0421 commented Jun 16, 2022

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

I changed microk8s to standard k8s instance (with three nodes), and I didn't met the problem again. By the way, I used the insecure registry rather than external registry.

@gudiandian
Copy link

Hello @aurickq , I also can't run hello_world, but I met a different problem:

The push refers to repository [docker.io/cindybrain/adaptdl-submit]
42247853ddec: Layer already exists 
7bb07b4b650b: Layer already exists 
6f3a145bdf9a: Layer already exists 
07752303aace: Layer already exists 
a598855c21e0: Layer already exists 
2c7950a1245f: Layer already exists 
9c1b6dd6c1e6: Layer already exists 
latest: digest: sha256:cc13db8e078711414917d022979ca54f73e48167d32f33a59cd3eb38830df392 size: 1790
W0502 17:37:55.091458  148014 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Failure (InternalError): Internal error occurred: failed calling webhook "adaptdl-validator.adaptdl.svc.cluster.local": Post https://adaptdl-validator.adaptdl.svc:443/validate?timeout=10s: dial tcp 10.152.183.186:443: connect: connection refused

Because I am not familiar with ValidatingWebhookConfiguration, I don't know how to slove it.

Hi, have you solved the problem?

I changed microk8s to standard k8s instance (with three nodes), and I didn't met the problem again. By the way, I used the insecure registry rather than external registry.

Unfortunately, I am using standard k8s already. Thank you for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants