This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

deploy HCM timed out

Hello,

I launch a deployment of HCM 2020.05 on a centos 7.7  server that I will named "master server".

My installation is a non production environment that will be used as a training lab environment and is composed of :

- one master sever (a unique master)

- one worker serer

- one vertica server

- one postgresql server (external DB)

At the step of deployment HCM (on the doc it refers to this page : https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/DeploySuite), I got a timeout error for some components as in the below capture:

Indeed, some pods were not running after deployment : 

root@hcm-master-1 ~]# kubectl get pods -n hcm-n0waq
NAME READY STATUS RESTARTS AGE
broker-0 2/2 Running 0 16h
hcm-accounts-6b4b4998b8-jmqkq 0/2 Init:2/3 0 16h
hcm-ara-85dd9bf988-tpdcq 1/2 ErrImagePull 0 16h
hcm-autopass-5cfcb69875-9dd8z 2/2 Running 1 16h
hcm-cloudsearch-758fb49fdf-kj9g2 2/2 Running 0 16h
hcm-co-optimizer-7567549969-t6f9p 1/2 Running 36 14h
hcm-composer-59d5f5c4dd-sxm7s 0/2 Init:1/2 0 14h
hcm-composer-gateway-57b464f5c8-vtv94 0/2 Init:1/2 0 16h
hcm-content-fnsqk 0/1 Init:2/4 0 16h
hcm-content-tools-gqwxh 0/1 Completed 0 16h
hcm-coso-config-data-xrjk6 0/1 Completed 0 16h
hcm-costpolicy-b6f97c65-vw247 0/2 Init:3/6 0 16h
hcm-csa-75487f85bf-dxq5f 1/2 ErrImagePull 1 16h
hcm-csa-collector-6965565f4d-z8bs5 0/2 Init:1/3 0 16h
hcm-elasticsearch-75f45bc7db-642sz 2/2 Running 0 16h
hcm-idm-config-data-pg5kq 0/1 Completed 0 16h
hcm-image-catalog-7b544bb886-nqcvf 2/2 Running 0 14h
hcm-integration-gateway-c9d5f9c86-5vt65 0/2 Init:1/2 0 16h
hcm-itom-di-dp-master-dpl-6565b44b-5xvcs 2/2 Running 2 16h
hcm-mpp-5467f84487-gr2p8 0/2 Init:2/3 0 16h
hcm-nginx-ingress-controller-94fxz 1/1 Running 0 16h
hcm-oo-585bffff9f-6qp4g 2/2 Running 0 16h
hcm-oodesigner-7cbfd8d9df-9kwql 1/2 ErrImagePull 1 16h
hcm-policy-gateway-6ff9cb8f45-dgdg9 0/2 Init:1/2 0 16h
hcm-scheduler-6cdcb46688-q8fgr 2/2 Running 1 16h
hcm-showback-7d5b76d688-nmmkl 0/2 Init:4/5 0 16h
hcm-showback-gateway-7b44f558c4-qnxzb 0/2 Init:1/2 0 16h
hcm-ucmdb-544cbf8cd8-lkblb 1/2 ErrImagePull 1 16h
hcm-ucmdb-browser-56fbbdb89d-fmvng 0/2 Init:1/2 0 16h
hcm-ucmdb-probe-7795b5bfdd-t6fvb 0/2 Pending 0 16h
itom-di-administration-77c4dfc79-46zzv 2/2 Running 1 16h
itom-di-dp-job-submitter-dpl-54d54f9798-cmq2q 2/2 Running 0 16h
itom-di-dp-worker-dpl-c67997845-2zm9j 2/2 Running 2 16h
itom-di-receiver-dpl-db984c989-wtj9q 2/2 Running 0 16h
itom-di-vertica-ingestion-d547dbf86-xpn7r 1/2 Running 100 16h
itom-di-zk-dpl-0 1/1 Running 0 16h

I try to restart kube on master and worker, but that does not change the Issue of pod status.I reboot server master because a process was stuck.
After that, some pods were running as expected like csa, but other still not like mpp : 

[root@hcm-master-1 ~]# kubectl get pods -n hcm-n0waq
NAME READY STATUS RESTARTS AGE
broker-0 2/2 Running 0 17h
hcm-accounts-6b4b4998b8-nkfvm 2/2 Running 4 17h
hcm-ara-85dd9bf988-j25f6 2/2 Running 0 17h
hcm-autopass-5cfcb69875-ghcfp 2/2 Running 0 18h
hcm-cloudsearch-758fb49fdf-hjxjh 2/2 Running 0 17h
hcm-co-optimizer-7567549969-4s2k5 1/2 ErrImagePull 0 18h
hcm-composer-59d5f5c4dd-8xdsr 2/2 Running 1 17h
hcm-composer-gateway-57b464f5c8-dtlwl 2/2 Running 20 17h
hcm-content-tools-gqwxh 0/1 Completed 0 4d16h
hcm-costpolicy-b6f97c65-qhmw4 1/2 Running 73 17h
hcm-csa-75487f85bf-2wmlg 2/2 Running 0 18h
hcm-csa-collector-6965565f4d-pnqm8 2/2 Running 5 17h
hcm-elasticsearch-75f45bc7db-nqkfc 2/2 Running 0 17h
hcm-image-catalog-7b544bb886-97mqv 2/2 Running 7 17h
hcm-integration-gateway-c9d5f9c86-j75x6 2/2 Running 2 17h
hcm-itom-di-dp-master-dpl-6565b44b-chnkm 2/2 Running 0 17h
hcm-mpp-5467f84487-gr2p8 1/2 Running 123 4d16h
hcm-nginx-ingress-controller-94fxz 1/1 Running 2 4d17h
hcm-oo-585bffff9f-6qp4g 2/2 Running 5 4d16h
hcm-oodesigner-7cbfd8d9df-s4l2j 2/2 Running 0 17h
hcm-policy-gateway-6ff9cb8f45-kbdpg 0/2 Init:1/2 0 17h
hcm-scheduler-6cdcb46688-9cs6h 2/2 Running 0 17h
hcm-showback-7d5b76d688-nmmkl 2/2 Running 1 4d16h
hcm-showback-gateway-7b44f558c4-z2tfx 2/2 Running 6 17h
hcm-ucmdb-544cbf8cd8-nxhcp 2/2 Running 0 17h
hcm-ucmdb-browser-56fbbdb89d-t2ql4 2/2 Running 3 17h
hcm-ucmdb-probe-7795b5bfdd-vzlnm 0/2 Pending 0 17h
itom-di-administration-77c4dfc79-sfb8n 2/2 Running 2 17h
itom-di-dp-job-submitter-dpl-54d54f9798-bffzs 2/2 Running 0 17h
itom-di-dp-worker-dpl-c67997845-hvcl7 2/2 Running 2 17h
itom-di-receiver-dpl-db984c989-sfpx8 2/2 Running 0 17h
itom-di-vertica-ingestion-d547dbf86-ggwfp 1/2 Running 7 17h
itom-di-zk-dpl-0 1/1 Running 0 17h 

[root@hcm-master-1 ~]# /opt/kubernetes/bin/kube-status.sh

Server certificate expiration date: Sep 10 13:01:18 2024 GMT, 357 days left

Get Node IP addresses ...

Master servers: hcm-master-1.xxx.fr
Worker servers: hcm-master-1.xxx.fr hcm-worker-1.xxx.fr

Checking status on 10.X.Y.40
--------------------------------------
Local services status:
[DockerVersion] Docker:v19.03.5 ...................................... Running
[DockerStorageFree] docker ........................................... 11.191 GB
[KubeVersion] Client:v1.15.5 Server:v1.15.5 ......................... Running
[NativeService] docker ............................................... Running
[NativeService] kubelet .............................................. Running
[NativeService] kube-proxy ........................................... Running

Cluster services status:
[APIServer] API Server - hcm-master-1.xxx.fr:8443 ........ Running
[MngPortal] URL: hcm-master-1.xxx.fr:5443 ................ Running
[Node]
(Master) hcm-master-1.xxx.fr ................................ Running
(Worker) hcm-master-1.xxx.fr ................................ Running
(Worker) hcm-worker-1.xxx.fr ................................ Running
[Pod]
<hcm-master-1.xxx.fr>
(kube-system) apiserver-hcm-master-1.xxx.fr ................. Running
(kube-system) controller-hcm-master-1.xxx.fr ................ Running
(kube-system) scheduler-hcm-master-1.xxx.fr ................. Running
(kube-system) etcd-hcm-master-1.xxx.fr ...................... Running
<hcm-worker-1.xxx.fr>
[DaemonSet]
(kube-system) coredns ........................................... 1/1
(kube-system) kube-flannel-ds-amd64 ............................. 2/2
(core) kube-registry ............................................ 1/1
(core) fluentd .................................................. 2/2
(core) itom-logrotate ........................................... 2/2
[Deployment]
(kube-system) heapster-apiserver ................................ 1/1
(core) metrics-server ........................................... 1/1
(core) itom-cdf-tiller .......................................... 1/1
(core) itom-logrotate-deployment ................................ 1/1
(core) idm ...................................................... 2/2
(core) mng-portal ............................................... 1/1
(core) cdf-apiserver ............................................ 1/1
(core) suite-installer-frontend ................................. 1/1
(core) nginx-ingress-controller ................................. 2/2
(core) itom-cdf-ingress-frontend ................................ 2/2
[Service]
(default) kubernetes ............................................ Running
(kube-system) heapster .......................................... Running
(core) idm-svc .................................................. Running
(core) kube-dns ................................................. Running
(core) kube-registry ............................................ Running
(core) kubernetes-vault ......................................... Running
(core) mng-portal ............................................... Running
(core) suite-installer-svc ...................................... Running
(core) cdf-svc .................................................. Running
(core) cdf-suitefrontend-svc .................................... Running
(core) nginx-ingress-controller-svc ............................. Running
(core) itom-cdf-ingress-frontend-svc ............................ Running
(core) metrics-server ........................................... Running
[NFS]
<PersistentVolume: hcm-n0waq-db-backup-vol>
hcm-master-1.xxx.fr:/var/vols/itom/db-backup ................ Passed
<PersistentVolume: hcm-n0waq-hcm-vol-claim>
hcm-master-1.xxx.fr:/var/vols/itom/hcm ...................... Passed
<PersistentVolume: itom-logging>
hcm-master-1.xxx.fr:/var/vols/itom/logs ..................... Passed
<PersistentVolume: itom-vol>
10.X.Y.40:/var/vols/itom/core ............................... Passed
[DB] cdfidmdb ........................................................ Passed

Full CDF is Running

My question is how to know where to start searching components that first failed.I know the command to investigate on the error for some pods not running, as for example :

# kubectl logs cdf-apiserver-7c4fbc8b7c-92fxg -n core -c cdf-apiserver

But I would like to kniow if there is a log file that describe a chronology of the events like beginning to deploy csa pod, then mpp pod, then oo pod, ...etc.Does this file exist somewhere (I want to know the first pod that were in error during deployment to fix this error as there may be some dependences between pod installation).

Also, I wanted to uninstall HCM to install again from portal management, but login retun "username/password" error while credentials did not change since the last timeI try to deploy HCM.
Is it possible to change credential of management portal to uninstall properly ?

Thanks in advance for your help !

Regards,

Jean-Philippe

Tags:

Parents
  • 0

    Hello Jean-Philippe,

    Thank you for this information and for opening a new discussion.

    I'm sorry that things are still not working after addressing the other issues.

    Let me help answer some of your questions so that you can move forward.

    There are instructions on how to reset the HCM Portal Management password.  Those instructions can be found at the following URL:

    Reset CDF management portal password
    https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/ResetCDFloginpassword

    I would suggest using the second option of resetting from within the IdM Pod.

    Once you reset the password, you should be able to remove everything and attempt the installation again.

    There may be a log file within the /opt/kubernetes/log directory that may show the installation process.  However, that will not be very helpful since many of the various pods are not dependent on each other.

    Let me help provide you with some information and some commands that you can use to help you isolate the issue.

    Use the following command to return all of the pods that are not running normally.

    # kubectl get pods --all-namespaces -o wide | awk -F " *|/" '($3!=$4 || $5!="Running") && $5!="Completed" {print $0}'

    This should return only the pods within all of the namespaces that are having problems.  This means that it will be a shorter list that you can focus on, etc.

    As you look at this list, if you see pods that are in either an 'init' or 'pending' state, you can ignore those for now.  This normally means that they are waiting for some other pod or function to be fully running before they come up fully.  They are dependent on another pod that is having problems.

    Look at the pods that are in a 'running' or 'ErrImagePull' state but not fully up would be a good starting point.

    First, I know that Cloud Optimizer is dependent on Vertica, so if the Vertica or associated pods are not running properly, then you would start with the Vertica pods.  These are the pods that start with 'itom-di-...' within the list.

    Here are two commands (one of which you already know) to look at information for the pods.

    # kubectl describe pod <pod_name> -n <namespace> 

    # kubectl logs <pod_name> -n <namespace> -c <container>

    The first one will provide the internal configuration of the pod and what it is supposed to run.  It has an 'init' section that provides the list of containers that need to be running prior to it starting up.  There is also a 'container' section that is the containers that are part of the running environment.  When you see running 1/2, that means that it is working on bringing up the 2nd container within the container list.

    At the end of the describe output is the Errors section that will also provide some indication of the problem.

    The second command, you already know.  Just remember to include the -c <container> part.  If you forget it on some, it will provide you with a list of possible answers.  On others, it will not and will not provide you with what you think it should be.  Again, the <container> is from the list within the describe output.

    I hope that this information helps.

    Good luck.

    Regards,

    Mark

  • 0 in reply to 

    Hello Mark,

    Thank you for the time you spent providing me your help on this Issue.

    First I follow up the second option on https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/ResetCDFloginpassword

    And return of commands confirm me the password for CDF management portal is reset : 

    #kubectl exec -it $(kubectl get pod -n core -ocustom-columns=NAME:.metadata.name |grep idm|head -1) -n core -c idm sh

    # sh /idmtools/idm-installer-tools/idm.sh databaseUser resetPassword -org Provider -name "admin" -plainPwd XXXXX
    INFO Password for admin is reset
    # sh /idmtools/idm-installer-tools/idm.sh databaseUser unlockUser -org Provider -name admin
    INFO User admin is unlocked successfully.

    So, I try to connect to portal with this temporary password, but I continue having the same error HTTP/403 (Authentication Failure) : 

    As I was unable to access cdf management portal, I continue investigating by looking on pods that are having problems.

    I have no pods with a state "ErrImagePull", but some are running, but not completely as you explained (0/2 or 1/2).

    I have the pod ucmdb pod with a status "ImagePullBackOff"

    hcm-n0waq     hcm-ucmdb-probe-7795b5bfdd-m6cm9                1/2     ImagePullBackOff   0          22h     172.16.34.143   hcm-worker-1.xxx.fr   <none>           <none>

    [root@hcm-master-1 ~]# kubectl describe pod hcm-ucmdb-probe-7795b5bfdd-m6cm9 -n hcm-n0waq
    Name: hcm-ucmdb-probe-7795b5bfdd-m6cm9
    Namespace: hcm-n0waq
    Priority: -1006000
    Priority Class Name: cdf-lowest-priority
    Node: hcm-worker-1.xxx.fr/10.X.Y.41
    Start Time: Tue, 19 Sep 2023 16:51:19 +0200
    Labels: app=hcm-ucmdb-probe-app
    pod-template-hash=7795b5bfdd
    Annotations: pod.boostport.com/vault-approle: hcm-n0waq-core
    pod.boostport.com/vault-init-container: install
    Status: Pending
    IP: 172.16.34.143
    Controlled By: ReplicaSet/hcm-ucmdb-probe-7795b5bfdd

    (...)

    at the end of describe a section events mentioned the error ImagePullBackOff : 

    (...)

    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning Failed 15m (x175 over 21h) kubelet, hcm-worker-1.xxx.fr Failed to pull image "localhost:5000/hpeswitom/itom-cmdb-probe:2020.02.118": rpc error: code = Unknown desc = context canceled
    Warning Failed 5m50s (x3868 over 21h) kubelet, hcm-worker-1.xxx.fr Error: ImagePullBackOff
    Normal BackOff 41s (x3881 over 21h) kubelet, hcm-worker-1.xxx.fr Back-off pulling image "localhost:5000/hpeswitom/itom-cmdb-probe:2020.02.118"

    If I have a look at mpp pod : 

    hcm-n0waq     hcm-mpp-5467f84487-n9r6c                        1/2     Running            1

    [root@hcm-master-1 ~]# kubectl describe pod hcm-mpp-5467f84487-n9r6c -n hcm-n0waq
    Name: hcm-mpp-5467f84487-n9r6c
    Namespace: hcm-n0waq
    Priority: -1000000
    Priority Class Name: cdf-highest-priority
    Node: hcm-master-1.xxx.fr/10.X.Y.40
    Start Time: Tue, 19 Sep 2023 16:40:10 +0200
    Labels: app=hcm-mpp-app
    pod-template-hash=5467f84487
    Annotations: pod.boostport.com/vault-approle: hcm-n0waq-core
    pod.boostport.com/vault-init-container: install
    Status: Running
    IP: 172.16.49.56
    Controlled By: ReplicaSet/hcm-mpp-5467f84487
    Init Containers:
    install:
    Container ID: docker://1818b1bfd428edf9100e81ea7e5633354d193b2632178ecb1420eb30c5a50c04
    Image: localhost:5000/hpeswitom/kubernetes-vault-init:0.8.0-006
    Image ID: docker-pullable://localhost:5000/hpeswitom/kubernetes-vault-init@sha256:cc67ff15f8f46e8cf6243e1386bb3a93a15270d5d894ffb19b4836ffb03e8f41
    Port: <none>
    Host Port: <none>
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:42:15 +0200
    Finished: Wed, 20 Sep 2023 11:42:43 +0200
    Ready: True
    Restart Count: 4
    Environment:
    VAULT_ROLE_ID: 4fb0aa94-82a5-a911-a7bd-2b2e7d510ef5
    CERT_COMMON_NAME: hcm-master-1.xxx.fr
    Mounts:
    /var/run/secrets/boostport.com from vault-token (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)
    waitfor-idm:
    Container ID: docker://17b582e2bafe9c3f340c661824c000bd180db9ce04e0277eeb48686d5ec5c719
    Image: localhost:5000/hpeswitom/itom-busybox:1.30.0-003
    Image ID: docker-pullable://localhost:5000/hpeswitom/itom-busybox@sha256:c8a39f236b193a174acc74280a36a2b3faca7cb3ea6bc708b63d8fd628950784
    Port: <none>
    Host Port: <none>
    Command:
    sh
    -c
    until nc -vz idm-svc.core 443 -w 5; do echo waiting for idm-svc.core; done;
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:42:50 +0200
    Finished: Wed, 20 Sep 2023 11:50:22 +0200
    Ready: True
    Restart Count: 0
    Environment: <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)
    waitfor-csa:
    Container ID: docker://617f83e9d6aaaa07ab0dfcba6a410fcc07a01efbf550bda6349158f5170f5748
    Image: localhost:5000/hpeswitom/itom-busybox:1.30.0-003
    Image ID: docker-pullable://localhost:5000/hpeswitom/itom-busybox@sha256:c8a39f236b193a174acc74280a36a2b3faca7cb3ea6bc708b63d8fd628950784
    Port: <none>
    Host Port: <none>
    Command:
    sh
    -c
    until nc -vz ${HCM_CSA_SVC_SERVICE_HOST} ${HCM_CSA_SVC_SERVICE_PORT} -w 5; do echo waiting for hcm-csa-svc; done;
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:50:50 +0200
    Finished: Wed, 20 Sep 2023 13:00:57 +0200
    Ready: True
    Restart Count: 0
    Environment: <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)

    And this is the log I get on pod mpp
    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c hcm-mpp
    Group 'mail' not found. Creating the user mailbox file with 0600 mode.

    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c install

    time="2023-09-20T09:42:15Z" level=info msg="pod IP:172.16.49.56"
    time="2023-09-20T09:42:15Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:20Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:25Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:30Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:40Z" level=info msg="role id:4fb0aa94-82a5-a911-a7bd-2b2e7d510ef5"
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate certificates. Exiting."
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate trustedCAs. Exiting."
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate issue_ca. Exiting."

    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c waitfor-idm
    waiting for idm-svc.core
    waiting for idm-svc.core

    (...)

    [root@hcm-master-1 ~]# kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c waitfor-csa |more
    waiting for hcm-csa-svc
    waiting for hcm-csa-svc

    (...)

    Not very interesting...

    Regards,

    Jean-Philippe

  • 0 in reply to 

    Update...
    I've been able to reset admin account of this access :https://hcm-master-1.xxx.fr:5443 

    But not possible to login to https://hcm-master-1.xxx.fr:3000

Reply Children
  • 0 in reply to 

    Hello Jean-Phiippe,

    I apologize for not noticing the difference earlier.

    You can only access port 3000 before the HCM installation.  Afterwards, it will not be accessible since the suite is installed.

    First, confirm that all of the pods within the 'core' namespace are running properly.

    Once that is set, you can login to the CDF Admin page (https://hcm-master-1.xxx.fr:5443) and uninstall the application.

    You should see the dots to the right of the 'hcm' entry and there is an option to 'uninstall'.

    This will uninstall items within the 'hcm' namespace but not the core.

    After, you can install again and see if it works.

    Good luck.

    Regards,

    Mark