Application Troubleshooting
1. Build Failure Troubleshooting
- Code Build Phase: In the code build phase, code packaging is performed first. If code packaging fails, the build task fails.
- Image Build Phase: If code packaging succeeds, it enters the image build phase. If image building fails, the build task fails.
- Complete: If image building succeeds, the build task is completed.
Common Issues in Code Build Phase (buildpack)
- Build logs stuck at
Start clone source code from git
source code acquisition phase- Please check if the code source is correct and if there are any permission issues.
- Please check if the network connection is normal.
- Build logs stuck at
make code package success, create build code job success
code build task startup phase- The code build task will start a Job task in the
rbd-system
namespace. Please check the status of this Job task.
$ kubectl get pod -n rbd-system
NAME READY STATUS RESTARTS AGE
ef9363dc8987cc5afa439e296e378622-20250317161307 1/1 Running 0 8s
...... - The code build task will start a Job task in the
- Other cases may be code issues. Please check the build logs in detail for troubleshooting.
Common Issues in Image Build Phase (buildkit)
- Build logs stuck at
code build success, create build code job success
image build task startup phase- The image build task will start a Job task in the
rbd-system
namespace. Please check the status of this Job task.
$ kubectl get pod -n rbd-system
NAME READY STATUS RESTARTS AGE
ef9363dc8987cc5afa439e296e378622-20250317161307-dockerfile 1/1 Running 0 8s
...... - The image build task will start a Job task in the
- Source code build shows error: failed to solve: goodrain.me/runner:latest-amd64. This usually occurs when unable to get the
runner
image from thegoodrain.me
image repository. Try manually pushing the image again:- Get the latest
runner
image
nerdctl pull registry.cn-hangzhou.aliyuncs.com/goodrain/runner:stable
- Push to the
goodrain.me
image repository
nerdctl tag registry.cn-hangzhou.aliyuncs.com/goodrain/runner:stable goodrain.me/runner:latest-amd64
nerdctl login goodrain.me -u admin -padmin1234 --insecure-registry
nerdctl push goodrain.me/runner:latest-amd64 --insecure-registry - Get the latest
BuildKit Source Code Build Configuration
By default, BuildKit is used as the source code build image packaging tool.
The BuildKit configuration file name defaults to goodrain-me
. If an image repository name is specified during installation, the configuration file name will be the image repository name, such as registry-cn-hangzhou-aliyuncs-com
.
DockerHub Image Acceleration
Dockerfile source code build references DockerHub images timeout. Modify BuildKit configuration for image acceleration.
apiVersion: v1
data:
buildkittoml: |-
debug = true
[registry."goodrain.me"]
http = false
insecure = true
+ [registry."docker.io"]
+ mirrors = ["docker.rainbond.cc"]
kind: ConfigMap
metadata:
name: goodrain.me
namespace: rbd-system
Source Code Build Error x509: certificate signed by unknown authority
When connecting to a private repository using HTTP protocol during installation, source code build fails with error x509: certificate signed by unknown authority
. Modify the BuildKit configuration file.
apiVersion: v1
data:
buildkittoml: |-
debug = true
[registry."goodrain.me"]
http = false
insecure = true
+ [registry."xxx.xxx.xxx.xxx:5000"]
+ http = true
+ insecure = true
kind: ConfigMap
metadata:
name: goodrain.me
namespace: rbd-system
Source Code Build Shows dial tcp look up goodrain.me on xxx:53: no such host
This is usually because the local /etc/hosts
does not automatically write the resolution for goodrain.me
. Re-write it using the following command:
kubectl delete pod -l name=rainbond-operator -n rbd-system
rainbond-operator will automatically restart and write the /etc/hosts
Job task.
2. Component Operation Troubleshooting
No Component Operation Log Information
Component logs are pushed through WebSocket
. If there are no log messages, go to Platform Management -> Cluster -> Edit and check if the WebSocket
communication address is correct and if local communication with this address is possible.
Component Abnormal State Troubleshooting
-
Scheduling: Component instances that remain in the Scheduling state, shown as orange-yellow squares, indicate that there are not enough resources in the cluster to run this instance. Specific resource shortage details can be found by clicking the orange-yellow square and checking the
Description
in the instance details page. For example:Instance Status: Scheduling
Reason: Unschedulable
Description: 0/1 nodes are available: 1 node(s) had desk pressure- From the
Description
, we can see that there is currently 1 host node in the cluster, but it is in an unavailable state because the node has disk pressure. After expanding disk space or cleaning up space based on the reason, this issue will automatically resolve. Common resource shortage types also include: insufficient CPU, insufficient memory.
- From the
-
Waiting to Start: Component instances remain in the waiting to start state. The Rainbond platform determines the startup order based on dependencies between components. If a component remains in the waiting to start state for a long time, it means some of its dependent components have not started normally. Switch to the application topology view to sort out component dependencies and ensure all dependent components are in normal running state.
-
Running Exception: The running exception state indicates that the instance cannot run normally. Click the red square to find prompts in the instance details page, focusing on the status of containers in the instance. Continue troubleshooting based on different statuses. Here are some common problem states:
- ImagePullBackOff: This state indicates that the current container's image cannot be pulled. Scroll down to the
Events
list for more detailed information. Ensure the corresponding image can be pulled. If you find that the image that cannot be pulled starts withgoodrain.me
, you can try building the component to solve the problem. - CrashLoopBackup: This state indicates that the current container itself failed to start or is experiencing runtime errors. Switch to the
Logs
page to view business log output and solve the problem. - OOMkilled: This state indicates that the memory allocated to the container is too small, or there is a memory leak in the business itself. The memory configuration entry for business containers is located on the
Scaling
page. The memory configuration entry for plugin containers is located on thePlugins
page.
- ImagePullBackOff: This state indicates that the current container's image cannot be pulled. Scroll down to the
3. Third-party Component Troubleshooting
Please follow these steps for third-party components:
- Open internal ports for third-party components
- Set health checks for third-party components
- Start/update third-party components
Until the third-party component status is Ready
, it can be used normally.
4. Application/Component HTTP External Access Issues
Error Code Explanation
Before troubleshooting, first understand the meaning of these error codes:
Error Code | Error Name | Meaning |
---|---|---|
502 | Bad Gateway | Gateway received invalid response from upstream server |
503 | Service Unavailable | Service currently unavailable (overloaded or under maintenance) |
504 | Gateway Timeout | Upstream server did not respond within specified time when gateway attempted to execute request |
502 Bad Gateway Error Troubleshooting
Possible Causes
- Backend Service Not Running Normally: Component in abnormal state or not started
- Port Configuration Error: Exposed port does not match actual service port
- Health Check Failure: Component cannot pass gateway health check
Troubleshooting Steps
-
Check Component Running Status
-
Verify Port Configuration
- Go to component details page → Ports, confirm internal and external service port configurations are correct
- Confirm the application in the container is actually listening on this port
# Enter Web terminal to check port listening status
netstat -nltp -
Check Service Logs
- View component operation logs for possible error messages
- Check in Platform Management → Logs → Gateway Logs
503 Service Unavailable Error Troubleshooting
Possible Causes
- Service Overload: Component resources insufficient or request volume too large
- Component Being Deployed/Updated: May be temporarily unavailable during rolling update
Troubleshooting Steps
- Check Component Resource Usage
- Check Component Ongoing Operations
- Check if there are ongoing deployment or update operations
- Check rolling update strategy configuration
504 Gateway Timeout Error Troubleshooting
Possible Causes
- Request Processing Time Too Long: Complex business logic or time-consuming data processing
- Upstream Service Timeout: Dependent services responding slowly
- Network Connection Issues: High network latency within cluster
Troubleshooting Steps
-
Check Component Timeout Settings
- Check component internal processing timeout settings
-
Locate Time-consuming Operations
- Check application logs for marked time-consuming operations
- Use performance analysis tools to locate bottlenecks
-
Troubleshoot Network Connection
- Check network latency
Gateway Log Analysis
Gateway log information is crucial for troubleshooting. Check in Platform Management -> Logs -> Gateway Logs.
Common error log patterns:
- For 502 errors, look for logs like
connection refused
orupstream unavailable
- For 503 errors, focus on
circuit breaking
orrate limited
related logs - For 504 errors, look for
timeout
related information
5. Unable to Upload Offline Packages, Software Packages, Jar, WAR, ZIP, etc.
This is usually caused by local browser failing to communicate with Rainbond WebSocket. You can modify the WebSocket
address in Platform Management -> Cluster -> Edit Cluster
.
6. Web Terminal Cannot Be Used
Web terminal cannot be used, usually because the WebSocket
address is configured incorrectly. You can modify the WebSocket
address in Platform Management -> Cluster -> Edit Cluster
.
7. Application/Component TCP External Access Issues
TCP external service actually uses K8S's NodePort service. If it cannot be accessed, please check the following points:
- Check if TCP port is listening on the server
- Check if
kube-proxy
service is running normally and if there are any error logs - For quick installation of Rainbond, only 10 TCP ports (30000~30010) are open by default. Add more TCP ports
- If exceeding K8S NodePort port range, please extend port range