This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

deploy HCM timed out

Hello,

I launch a deployment of HCM 2020.05 on a centos 7.7  server that I will named "master server".

My installation is a non production environment that will be used as a training lab environment and is composed of :

- one master sever (a unique master)

- one worker serer

- one vertica server

- one postgresql server (external DB)

At the step of deployment HCM (on the doc it refers to this page : https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/DeploySuite), I got a timeout error for some components as in the below capture:

Indeed, some pods were not running after deployment : 

root@hcm-master-1 ~]# kubectl get pods -n hcm-n0waq
NAME READY STATUS RESTARTS AGE
broker-0 2/2 Running 0 16h
hcm-accounts-6b4b4998b8-jmqkq 0/2 Init:2/3 0 16h
hcm-ara-85dd9bf988-tpdcq 1/2 ErrImagePull 0 16h
hcm-autopass-5cfcb69875-9dd8z 2/2 Running 1 16h
hcm-cloudsearch-758fb49fdf-kj9g2 2/2 Running 0 16h
hcm-co-optimizer-7567549969-t6f9p 1/2 Running 36 14h
hcm-composer-59d5f5c4dd-sxm7s 0/2 Init:1/2 0 14h
hcm-composer-gateway-57b464f5c8-vtv94 0/2 Init:1/2 0 16h
hcm-content-fnsqk 0/1 Init:2/4 0 16h
hcm-content-tools-gqwxh 0/1 Completed 0 16h
hcm-coso-config-data-xrjk6 0/1 Completed 0 16h
hcm-costpolicy-b6f97c65-vw247 0/2 Init:3/6 0 16h
hcm-csa-75487f85bf-dxq5f 1/2 ErrImagePull 1 16h
hcm-csa-collector-6965565f4d-z8bs5 0/2 Init:1/3 0 16h
hcm-elasticsearch-75f45bc7db-642sz 2/2 Running 0 16h
hcm-idm-config-data-pg5kq 0/1 Completed 0 16h
hcm-image-catalog-7b544bb886-nqcvf 2/2 Running 0 14h
hcm-integration-gateway-c9d5f9c86-5vt65 0/2 Init:1/2 0 16h
hcm-itom-di-dp-master-dpl-6565b44b-5xvcs 2/2 Running 2 16h
hcm-mpp-5467f84487-gr2p8 0/2 Init:2/3 0 16h
hcm-nginx-ingress-controller-94fxz 1/1 Running 0 16h
hcm-oo-585bffff9f-6qp4g 2/2 Running 0 16h
hcm-oodesigner-7cbfd8d9df-9kwql 1/2 ErrImagePull 1 16h
hcm-policy-gateway-6ff9cb8f45-dgdg9 0/2 Init:1/2 0 16h
hcm-scheduler-6cdcb46688-q8fgr 2/2 Running 1 16h
hcm-showback-7d5b76d688-nmmkl 0/2 Init:4/5 0 16h
hcm-showback-gateway-7b44f558c4-qnxzb 0/2 Init:1/2 0 16h
hcm-ucmdb-544cbf8cd8-lkblb 1/2 ErrImagePull 1 16h
hcm-ucmdb-browser-56fbbdb89d-fmvng 0/2 Init:1/2 0 16h
hcm-ucmdb-probe-7795b5bfdd-t6fvb 0/2 Pending 0 16h
itom-di-administration-77c4dfc79-46zzv 2/2 Running 1 16h
itom-di-dp-job-submitter-dpl-54d54f9798-cmq2q 2/2 Running 0 16h
itom-di-dp-worker-dpl-c67997845-2zm9j 2/2 Running 2 16h
itom-di-receiver-dpl-db984c989-wtj9q 2/2 Running 0 16h
itom-di-vertica-ingestion-d547dbf86-xpn7r 1/2 Running 100 16h
itom-di-zk-dpl-0 1/1 Running 0 16h

I try to restart kube on master and worker, but that does not change the Issue of pod status.I reboot server master because a process was stuck.
After that, some pods were running as expected like csa, but other still not like mpp : 

[root@hcm-master-1 ~]# kubectl get pods -n hcm-n0waq
NAME READY STATUS RESTARTS AGE
broker-0 2/2 Running 0 17h
hcm-accounts-6b4b4998b8-nkfvm 2/2 Running 4 17h
hcm-ara-85dd9bf988-j25f6 2/2 Running 0 17h
hcm-autopass-5cfcb69875-ghcfp 2/2 Running 0 18h
hcm-cloudsearch-758fb49fdf-hjxjh 2/2 Running 0 17h
hcm-co-optimizer-7567549969-4s2k5 1/2 ErrImagePull 0 18h
hcm-composer-59d5f5c4dd-8xdsr 2/2 Running 1 17h
hcm-composer-gateway-57b464f5c8-dtlwl 2/2 Running 20 17h
hcm-content-tools-gqwxh 0/1 Completed 0 4d16h
hcm-costpolicy-b6f97c65-qhmw4 1/2 Running 73 17h
hcm-csa-75487f85bf-2wmlg 2/2 Running 0 18h
hcm-csa-collector-6965565f4d-pnqm8 2/2 Running 5 17h
hcm-elasticsearch-75f45bc7db-nqkfc 2/2 Running 0 17h
hcm-image-catalog-7b544bb886-97mqv 2/2 Running 7 17h
hcm-integration-gateway-c9d5f9c86-j75x6 2/2 Running 2 17h
hcm-itom-di-dp-master-dpl-6565b44b-chnkm 2/2 Running 0 17h
hcm-mpp-5467f84487-gr2p8 1/2 Running 123 4d16h
hcm-nginx-ingress-controller-94fxz 1/1 Running 2 4d17h
hcm-oo-585bffff9f-6qp4g 2/2 Running 5 4d16h
hcm-oodesigner-7cbfd8d9df-s4l2j 2/2 Running 0 17h
hcm-policy-gateway-6ff9cb8f45-kbdpg 0/2 Init:1/2 0 17h
hcm-scheduler-6cdcb46688-9cs6h 2/2 Running 0 17h
hcm-showback-7d5b76d688-nmmkl 2/2 Running 1 4d16h
hcm-showback-gateway-7b44f558c4-z2tfx 2/2 Running 6 17h
hcm-ucmdb-544cbf8cd8-nxhcp 2/2 Running 0 17h
hcm-ucmdb-browser-56fbbdb89d-t2ql4 2/2 Running 3 17h
hcm-ucmdb-probe-7795b5bfdd-vzlnm 0/2 Pending 0 17h
itom-di-administration-77c4dfc79-sfb8n 2/2 Running 2 17h
itom-di-dp-job-submitter-dpl-54d54f9798-bffzs 2/2 Running 0 17h
itom-di-dp-worker-dpl-c67997845-hvcl7 2/2 Running 2 17h
itom-di-receiver-dpl-db984c989-sfpx8 2/2 Running 0 17h
itom-di-vertica-ingestion-d547dbf86-ggwfp 1/2 Running 7 17h
itom-di-zk-dpl-0 1/1 Running 0 17h 

[root@hcm-master-1 ~]# /opt/kubernetes/bin/kube-status.sh

Server certificate expiration date: Sep 10 13:01:18 2024 GMT, 357 days left

Get Node IP addresses ...

Master servers: hcm-master-1.xxx.fr
Worker servers: hcm-master-1.xxx.fr hcm-worker-1.xxx.fr

Checking status on 10.X.Y.40
--------------------------------------
Local services status:
[DockerVersion] Docker:v19.03.5 ...................................... Running
[DockerStorageFree] docker ........................................... 11.191 GB
[KubeVersion] Client:v1.15.5 Server:v1.15.5 ......................... Running
[NativeService] docker ............................................... Running
[NativeService] kubelet .............................................. Running
[NativeService] kube-proxy ........................................... Running

Cluster services status:
[APIServer] API Server - hcm-master-1.xxx.fr:8443 ........ Running
[MngPortal] URL: hcm-master-1.xxx.fr:5443 ................ Running
[Node]
(Master) hcm-master-1.xxx.fr ................................ Running
(Worker) hcm-master-1.xxx.fr ................................ Running
(Worker) hcm-worker-1.xxx.fr ................................ Running
[Pod]
<hcm-master-1.xxx.fr>
(kube-system) apiserver-hcm-master-1.xxx.fr ................. Running
(kube-system) controller-hcm-master-1.xxx.fr ................ Running
(kube-system) scheduler-hcm-master-1.xxx.fr ................. Running
(kube-system) etcd-hcm-master-1.xxx.fr ...................... Running
<hcm-worker-1.xxx.fr>
[DaemonSet]
(kube-system) coredns ........................................... 1/1
(kube-system) kube-flannel-ds-amd64 ............................. 2/2
(core) kube-registry ............................................ 1/1
(core) fluentd .................................................. 2/2
(core) itom-logrotate ........................................... 2/2
[Deployment]
(kube-system) heapster-apiserver ................................ 1/1
(core) metrics-server ........................................... 1/1
(core) itom-cdf-tiller .......................................... 1/1
(core) itom-logrotate-deployment ................................ 1/1
(core) idm ...................................................... 2/2
(core) mng-portal ............................................... 1/1
(core) cdf-apiserver ............................................ 1/1
(core) suite-installer-frontend ................................. 1/1
(core) nginx-ingress-controller ................................. 2/2
(core) itom-cdf-ingress-frontend ................................ 2/2
[Service]
(default) kubernetes ............................................ Running
(kube-system) heapster .......................................... Running
(core) idm-svc .................................................. Running
(core) kube-dns ................................................. Running
(core) kube-registry ............................................ Running
(core) kubernetes-vault ......................................... Running
(core) mng-portal ............................................... Running
(core) suite-installer-svc ...................................... Running
(core) cdf-svc .................................................. Running
(core) cdf-suitefrontend-svc .................................... Running
(core) nginx-ingress-controller-svc ............................. Running
(core) itom-cdf-ingress-frontend-svc ............................ Running
(core) metrics-server ........................................... Running
[NFS]
<PersistentVolume: hcm-n0waq-db-backup-vol>
hcm-master-1.xxx.fr:/var/vols/itom/db-backup ................ Passed
<PersistentVolume: hcm-n0waq-hcm-vol-claim>
hcm-master-1.xxx.fr:/var/vols/itom/hcm ...................... Passed
<PersistentVolume: itom-logging>
hcm-master-1.xxx.fr:/var/vols/itom/logs ..................... Passed
<PersistentVolume: itom-vol>
10.X.Y.40:/var/vols/itom/core ............................... Passed
[DB] cdfidmdb ........................................................ Passed

Full CDF is Running

My question is how to know where to start searching components that first failed.I know the command to investigate on the error for some pods not running, as for example :

# kubectl logs cdf-apiserver-7c4fbc8b7c-92fxg -n core -c cdf-apiserver

But I would like to kniow if there is a log file that describe a chronology of the events like beginning to deploy csa pod, then mpp pod, then oo pod, ...etc.Does this file exist somewhere (I want to know the first pod that were in error during deployment to fix this error as there may be some dependences between pod installation).

Also, I wanted to uninstall HCM to install again from portal management, but login retun "username/password" error while credentials did not change since the last timeI try to deploy HCM.
Is it possible to change credential of management portal to uninstall properly ?

Thanks in advance for your help !

Regards,

Jean-Philippe

Tags:

  • 0

    Hello Jean-Philippe,

    Thank you for this information and for opening a new discussion.

    I'm sorry that things are still not working after addressing the other issues.

    Let me help answer some of your questions so that you can move forward.

    There are instructions on how to reset the HCM Portal Management password.  Those instructions can be found at the following URL:

    Reset CDF management portal password
    https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/ResetCDFloginpassword

    I would suggest using the second option of resetting from within the IdM Pod.

    Once you reset the password, you should be able to remove everything and attempt the installation again.

    There may be a log file within the /opt/kubernetes/log directory that may show the installation process.  However, that will not be very helpful since many of the various pods are not dependent on each other.

    Let me help provide you with some information and some commands that you can use to help you isolate the issue.

    Use the following command to return all of the pods that are not running normally.

    # kubectl get pods --all-namespaces -o wide | awk -F " *|/" '($3!=$4 || $5!="Running") && $5!="Completed" {print $0}'

    This should return only the pods within all of the namespaces that are having problems.  This means that it will be a shorter list that you can focus on, etc.

    As you look at this list, if you see pods that are in either an 'init' or 'pending' state, you can ignore those for now.  This normally means that they are waiting for some other pod or function to be fully running before they come up fully.  They are dependent on another pod that is having problems.

    Look at the pods that are in a 'running' or 'ErrImagePull' state but not fully up would be a good starting point.

    First, I know that Cloud Optimizer is dependent on Vertica, so if the Vertica or associated pods are not running properly, then you would start with the Vertica pods.  These are the pods that start with 'itom-di-...' within the list.

    Here are two commands (one of which you already know) to look at information for the pods.

    # kubectl describe pod <pod_name> -n <namespace> 

    # kubectl logs <pod_name> -n <namespace> -c <container>

    The first one will provide the internal configuration of the pod and what it is supposed to run.  It has an 'init' section that provides the list of containers that need to be running prior to it starting up.  There is also a 'container' section that is the containers that are part of the running environment.  When you see running 1/2, that means that it is working on bringing up the 2nd container within the container list.

    At the end of the describe output is the Errors section that will also provide some indication of the problem.

    The second command, you already know.  Just remember to include the -c <container> part.  If you forget it on some, it will provide you with a list of possible answers.  On others, it will not and will not provide you with what you think it should be.  Again, the <container> is from the list within the describe output.

    I hope that this information helps.

    Good luck.

    Regards,

    Mark

  • 0 in reply to 

    Hello Mark,

    Thank you for the time you spent providing me your help on this Issue.

    First I follow up the second option on https://docs.microfocus.com/doc/Hybrid_Cloud_Management/2020.05/ResetCDFloginpassword

    And return of commands confirm me the password for CDF management portal is reset : 

    #kubectl exec -it $(kubectl get pod -n core -ocustom-columns=NAME:.metadata.name |grep idm|head -1) -n core -c idm sh

    # sh /idmtools/idm-installer-tools/idm.sh databaseUser resetPassword -org Provider -name "admin" -plainPwd XXXXX
    INFO Password for admin is reset
    # sh /idmtools/idm-installer-tools/idm.sh databaseUser unlockUser -org Provider -name admin
    INFO User admin is unlocked successfully.

    So, I try to connect to portal with this temporary password, but I continue having the same error HTTP/403 (Authentication Failure) : 

    As I was unable to access cdf management portal, I continue investigating by looking on pods that are having problems.

    I have no pods with a state "ErrImagePull", but some are running, but not completely as you explained (0/2 or 1/2).

    I have the pod ucmdb pod with a status "ImagePullBackOff"

    hcm-n0waq     hcm-ucmdb-probe-7795b5bfdd-m6cm9                1/2     ImagePullBackOff   0          22h     172.16.34.143   hcm-worker-1.xxx.fr   <none>           <none>

    [root@hcm-master-1 ~]# kubectl describe pod hcm-ucmdb-probe-7795b5bfdd-m6cm9 -n hcm-n0waq
    Name: hcm-ucmdb-probe-7795b5bfdd-m6cm9
    Namespace: hcm-n0waq
    Priority: -1006000
    Priority Class Name: cdf-lowest-priority
    Node: hcm-worker-1.xxx.fr/10.X.Y.41
    Start Time: Tue, 19 Sep 2023 16:51:19 +0200
    Labels: app=hcm-ucmdb-probe-app
    pod-template-hash=7795b5bfdd
    Annotations: pod.boostport.com/vault-approle: hcm-n0waq-core
    pod.boostport.com/vault-init-container: install
    Status: Pending
    IP: 172.16.34.143
    Controlled By: ReplicaSet/hcm-ucmdb-probe-7795b5bfdd

    (...)

    at the end of describe a section events mentioned the error ImagePullBackOff : 

    (...)

    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning Failed 15m (x175 over 21h) kubelet, hcm-worker-1.xxx.fr Failed to pull image "localhost:5000/hpeswitom/itom-cmdb-probe:2020.02.118": rpc error: code = Unknown desc = context canceled
    Warning Failed 5m50s (x3868 over 21h) kubelet, hcm-worker-1.xxx.fr Error: ImagePullBackOff
    Normal BackOff 41s (x3881 over 21h) kubelet, hcm-worker-1.xxx.fr Back-off pulling image "localhost:5000/hpeswitom/itom-cmdb-probe:2020.02.118"

    If I have a look at mpp pod : 

    hcm-n0waq     hcm-mpp-5467f84487-n9r6c                        1/2     Running            1

    [root@hcm-master-1 ~]# kubectl describe pod hcm-mpp-5467f84487-n9r6c -n hcm-n0waq
    Name: hcm-mpp-5467f84487-n9r6c
    Namespace: hcm-n0waq
    Priority: -1000000
    Priority Class Name: cdf-highest-priority
    Node: hcm-master-1.xxx.fr/10.X.Y.40
    Start Time: Tue, 19 Sep 2023 16:40:10 +0200
    Labels: app=hcm-mpp-app
    pod-template-hash=5467f84487
    Annotations: pod.boostport.com/vault-approle: hcm-n0waq-core
    pod.boostport.com/vault-init-container: install
    Status: Running
    IP: 172.16.49.56
    Controlled By: ReplicaSet/hcm-mpp-5467f84487
    Init Containers:
    install:
    Container ID: docker://1818b1bfd428edf9100e81ea7e5633354d193b2632178ecb1420eb30c5a50c04
    Image: localhost:5000/hpeswitom/kubernetes-vault-init:0.8.0-006
    Image ID: docker-pullable://localhost:5000/hpeswitom/kubernetes-vault-init@sha256:cc67ff15f8f46e8cf6243e1386bb3a93a15270d5d894ffb19b4836ffb03e8f41
    Port: <none>
    Host Port: <none>
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:42:15 +0200
    Finished: Wed, 20 Sep 2023 11:42:43 +0200
    Ready: True
    Restart Count: 4
    Environment:
    VAULT_ROLE_ID: 4fb0aa94-82a5-a911-a7bd-2b2e7d510ef5
    CERT_COMMON_NAME: hcm-master-1.xxx.fr
    Mounts:
    /var/run/secrets/boostport.com from vault-token (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)
    waitfor-idm:
    Container ID: docker://17b582e2bafe9c3f340c661824c000bd180db9ce04e0277eeb48686d5ec5c719
    Image: localhost:5000/hpeswitom/itom-busybox:1.30.0-003
    Image ID: docker-pullable://localhost:5000/hpeswitom/itom-busybox@sha256:c8a39f236b193a174acc74280a36a2b3faca7cb3ea6bc708b63d8fd628950784
    Port: <none>
    Host Port: <none>
    Command:
    sh
    -c
    until nc -vz idm-svc.core 443 -w 5; do echo waiting for idm-svc.core; done;
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:42:50 +0200
    Finished: Wed, 20 Sep 2023 11:50:22 +0200
    Ready: True
    Restart Count: 0
    Environment: <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)
    waitfor-csa:
    Container ID: docker://617f83e9d6aaaa07ab0dfcba6a410fcc07a01efbf550bda6349158f5170f5748
    Image: localhost:5000/hpeswitom/itom-busybox:1.30.0-003
    Image ID: docker-pullable://localhost:5000/hpeswitom/itom-busybox@sha256:c8a39f236b193a174acc74280a36a2b3faca7cb3ea6bc708b63d8fd628950784
    Port: <none>
    Host Port: <none>
    Command:
    sh
    -c
    until nc -vz ${HCM_CSA_SVC_SERVICE_HOST} ${HCM_CSA_SVC_SERVICE_PORT} -w 5; do echo waiting for hcm-csa-svc; done;
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Wed, 20 Sep 2023 11:50:50 +0200
    Finished: Wed, 20 Sep 2023 13:00:57 +0200
    Ready: True
    Restart Count: 0
    Environment: <none>
    Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-jzspg (ro)

    And this is the log I get on pod mpp
    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c hcm-mpp
    Group 'mail' not found. Creating the user mailbox file with 0600 mode.

    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c install

    time="2023-09-20T09:42:15Z" level=info msg="pod IP:172.16.49.56"
    time="2023-09-20T09:42:15Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:20Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:25Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:30Z" level=warning msg="Failed to get pod approle, try again after 5 seconds."
    time="2023-09-20T09:42:40Z" level=info msg="role id:4fb0aa94-82a5-a911-a7bd-2b2e7d510ef5"
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate certificates. Exiting."
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate trustedCAs. Exiting."
    time="2023-09-20T09:42:43Z" level=info msg="Successfully generate issue_ca. Exiting."

    # kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c waitfor-idm
    waiting for idm-svc.core
    waiting for idm-svc.core

    (...)

    [root@hcm-master-1 ~]# kubectl logs hcm-mpp-5467f84487-n9r6c -n hcm-n0waq -c waitfor-csa |more
    waiting for hcm-csa-svc
    waiting for hcm-csa-svc

    (...)

    Not very interesting...

    Regards,

    Jean-Philippe

  • 0 in reply to 

    Update...
    I've been able to reset admin account of this access :https://hcm-master-1.xxx.fr:5443 

    But not possible to login to https://hcm-master-1.xxx.fr:3000

  • 0 in reply to 

    Hello Jean-Phiippe,

    I apologize for not noticing the difference earlier.

    You can only access port 3000 before the HCM installation.  Afterwards, it will not be accessible since the suite is installed.

    First, confirm that all of the pods within the 'core' namespace are running properly.

    Once that is set, you can login to the CDF Admin page (https://hcm-master-1.xxx.fr:5443) and uninstall the application.

    You should see the dots to the right of the 'hcm' entry and there is an option to 'uninstall'.

    This will uninstall items within the 'hcm' namespace but not the core.

    After, you can install again and see if it works.

    Good luck.

    Regards,

    Mark

  • 0 in reply to 

    Hello Mark,
    No problem Mark, now I know that port 3000 is not accessible after installation.
    I confirm that core namespace is running properly.

    So I proceed with uninstalling application.I'l keep you inform about the result.

    Thanks and regards,

    Jean-Philippe

     

  • 0 in reply to 

    I finished uninstallation of deployment hcm-n0waq from Portal.Then I proceed the followinf uninstallation manually : 

    - remove 2 nfs mount point on master server : 

    • /var/vols/itom/hcm
    • /var/vols/itom/core/suite-install/hcm/output

    - uninstall the HCM schema from Vertica server.

    - drop the following databases on postgresql external DB :

    • csa, oo, oodesigner, ucmdb, autopass, ara

    Then I proceed with a new installation, but as on the previous installation, at the beginning of component Data Analytics Framework, and CSA suite, I received a timeout (at around 01h29, perhaps even before) :

    Deployment is not yet finished

    How can I have more information on the cause of this timeout ?

    Now I can successfully connect to csa, and oo portal.I dont know url of ucmdb, but mpp is not accessible on TCP/8089

    After connecting HCM Portal / workload /Pods, I found some information regarding readiness probe failed and a connection refused from pod hcm-mpp-69c5b54bc6-pxtc4 and also from pod hcm-oodesigner-5d46676cf7-nbbsj

    I'm wondering if the timeout could be from connection ssh between master to worker.Would it be a good idea trying do delete those pods to see if new ones can be running ?

    Best regards,

    Jean-Philippe.

  • 0 in reply to 

    Hello Jean-Philippe,

    I'm sorry to hear that you are still encountering problems.

    I believe that the reason that the installation is timing out is that it is waiting for all of the pods to come up before stating that it is completed and finished.  Since not all of the pods are coming up, it is timing out.

    Your comment about communication between the master and worker nodes is important to note.

    The Readiness probe is simply a check to see if the application can communicate with the pod on the specified port, etc.

    I would suggest that you disable the local firewalld on all the nodes if it is not already disabled as a test.

    I seem to remember you mentioning that you have all of the nodes already listed within /etc/hosts on all of the nodes.

    You can run the "ip r" command to list the routing tables on the various hosts to confirm that the various nodes are listed with the different 172.16.* addresses.  Each node will have it's own 172.16.* subnet.  You may also see 172.17.* within the list for other things.

    If all of the above appear to be fine, then I would suggest looking closer at the problem pods.  Determine on the describe and logs to see why it is not coming up fully.

    You could check things like Cluster Resources and NFS Disk space just rule out the possibility.  Otherwise, it may be confirmation that the Pod can communicate with DBs or other things.  You could also check the PostgreSQL and Vertica log files depending on which pod we are talking about, etc.

    I hope that this is helpful.  Good luck.

    Regards,

    Mark

  • 0 in reply to 

    Hello Mark,

    Than you very much for providing me these new clues.

    I follow up your advices, and I stop and disable firewall service on some nodes on which it was still active.

    You are right, I check and confirm you that all nodes have on their /etc/hosts the same list of fqdn of the other nodes, and I just ping successfully the nodes to confirm this point.

    Routing table seems for me to be correct as the result I get confirm network 172.16.* and 172.17.* on master and worker: 

    On master server

    default via 10.X.Y.1 dev eth0 proto static metric 100

    10.X.Y.0/24 dev eth0 proto kernel scope link src 10.X.Y.40 metric 100

    46.X.Y.52/30 dev eth1 proto kernel scope link src 46.X.Y.54 metric 101

    172.16.34.0/24 via 10.X.Y.41 dev eth0

    172.16.49.0/24 dev cni0 proto kernel scope link src 172.16.49.1

    172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

    192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1

    On worker server

    default via 10.X.Y.1 dev eth0 proto static metric 100

    10.X.Y.0/24 dev eth0 proto kernel scope link src 10.X.Y.41 metric 100

    172.16.34.0/24 dev cni0 proto kernel scope link src 172.16.34.1

    172.16.49.0/24 via 10.X.Y.40 dev eth0

    172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

    On vertica server

    default via 10.X.Y.1 dev eth0 proto static metric 100

    10.X.Y.0/24 dev eth0 proto kernel scope link src 10.X.Y.42 metric 100

    On postgres server

    default via 10.X.Y.1 dev eth0 proto static metric 100

    10.X.Y.0/24 dev eth0 proto kernel scope link src 10.X.Y.43 metric 100

     

    Now, I'll focus on logs on the different pods that are having problems, and keep you inform about my result.

    Best regards,

    Jean-Philippe

  • 0 in reply to 

    Hello Mark,

    I launch a new installation and this is the state after the deployment (still timeout on some components on the installation page).

    [root@hcm-master-1 hcm-content-tools-wsjkz]# kubectl get pods -n hcm-1mkjh
    NAME READY STATUS RESTARTS AGE
    broker-0 2/2 Running 1 2d15h
    hcm-accounts-68c7796cf9-b4z6h 2/2 Running 0 2d15h
    hcm-autopass-c4f5ddc84-tqbwd 2/2 Running 0 2d15h
    hcm-cloudsearch-5f789b866d-kqsjs 1/2 Running 476 2d15h
    hcm-composer-6c856b7d55-95v6x 0/2 Pending 0 2d15h
    hcm-composer-gateway-5669bff697-4qb8v 0/2 Pending 0 2d15h
    hcm-content-tmv7w 0/1 Completed 0 2d15h
    hcm-content-tools-wsjkz 0/1 Completed 0 2d15h
    hcm-coso-config-data-7sqxd 0/1 Completed 0 2d15h
    hcm-csa-7498ff58f-6nw64 2/2 Running 0 2d15h
    hcm-csa-collector-5f5dbb8757-qq9pc 0/2 Init:2/3 0 2d15h
    hcm-elasticsearch-7667f9c765-jczwx 2/2 Running 0 2d15h
    hcm-idm-config-data-n8cpm 0/1 Completed 0 2d15h
    hcm-image-catalog-56bbd5f545-ddh48 0/2 Pending 0 2d15h
    hcm-integration-gateway-5bdb79bcdc-h6n9v 0/2 Pending 0 2d15h
    hcm-itom-di-dp-master-dpl-59978dcc59-q56z5 0/2 Pending 0 2d15h
    hcm-mpp-5d464d6464-mvhb7 1/2 Running 470 2d15h
    hcm-nginx-ingress-controller-mwwnq 1/1 Running 0 2d15h
    hcm-oo-dbb59ccb5-h45wn 2/2 Running 0 2d15h
    hcm-oodesigner-c7bdfb985-svgkx 0/2 Pending 0 2d15h
    hcm-scheduler-7b5949f7ff-5t69f 2/2 Running 0 2d15h
    hcm-showback-56947c9787-xbntg 0/2 Pending 0 2d15h
    hcm-showback-gateway-846999c5d9-xm26z 0/2 Pending 0 2d15h
    hcm-ucmdb-browser-b7748d965-8qblv 0/2 Pending 0 2d15h
    hcm-ucmdb-d79dbdf44-qzbm2 0/2 Pending 0 2d15h
    hcm-ucmdb-probe-69bd67f5d7-grvpk 0/2 Pending 0 2d15h
    itom-di-administration-74fb7578bd-8lpkh 0/2 Pending 0 2d15h
    itom-di-dp-job-submitter-dpl-c8f799d57-lrxkq 0/2 Pending 0 2d15h
    itom-di-dp-worker-dpl-566db79dd6-k5cn9 0/2 Pending 0 2d15h
    itom-di-receiver-dpl-585b6f7c5d-rgfvw 0/2 Pending 0 2d15h
    itom-di-vertica-ingestion-6cf4fcb5f7-g9vpg 0/2 Pending 0 2d15h
    itom-di-zk-dpl-0 1/1 Running 2 2d15h

    When I search for logs on pod csa, I get : 

    # kubectl logs hcm-csa-7498ff58f-6nw64 hcm-csa -n hcm-1mkjh

    Group 'mail' not found. Creating the user mailbox file with 0600 mode.
    Certificate was added to keystore
    Certificate was added to keystore
    Certificate was added to keystore

    (...)

    find: '/usr/local/microfocus/csa/jboss-as/standalone/deployments/csa.war/administration/locales': No such file or directory

    (...)

    Entry for alias cert_2 successfully imported.
    Entry for alias cert_1 successfully imported.
    Import command completed: 7 entries successfully imported, 0 entries failed or cancelled

    Warning:
    Migrated "/usr/local/microfocus/csa/jboss-as/standalone/configuration/capsuletruststore" to Non JKS/JCEKS. The JKS keystore is backed up as "/usr/local/microfocus/csa/jboss-as/standalone/configuration/capsuletruststore.old".
    cp: -r not specified; omitting directory '/usr/local/microfocus/csa/encryption/old'
    Replaced 14 of 14 ENC(...) occurrences in /usr/local/microfocus/csa/jboss-as/standalone/deployments/csa.war/WEB-INF/classes/csa.properties.
    Replaced 1 of 1 ENC(...) occurrences in /usr/local/microfocus/csa/jboss-as/standalone/deployments/csa.war/WEB-INF/applicationContext-security.xml.
    Replaced 1 of 1 ENC(...) occurrences in /usr/local/microfocus/csa/tools/ucmdb-component-import-tool/UCMDBComponentImportTool/config.properties.
    Replaced 2 of 2 ENC(...) occurrences in config.properties.
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    jq: error (at <stdin>:0): Cannot iterate over null (null)

    Best regards,

    Jean-Philippe HAAG