FAQ
Problem Diagnosis
-
When filing an error report, please include the log file, which typically resides in a subfolder under the output folder of XGen (
~/output
inside XGen container andOutput/
in XGen folder on the host) (see XGen Results). -
If you have any problems pulling docker images, make sure you have logged into the Docker repository and that you have the correct permissions. See installation instructions.
-
If you get an error message about the GPU, ensure that Nvidia-Driver and Nvidia-Docker are correctly installed and their versions meet the requirements. See installation instructions
-
If after XGen is started, it lists no devices, please check the connections and setup of the devices. See installation instructions. Use
xgenctl status
in XGen, Agent and Controller shell to see the services status. If there is a service failure or exit, try using the commandxgenctl restart
to restart the service. -
If the following error appears, it is possible that your devices are disconnected. Please check the connections and setup of the devices. See installation instructions. Use
xgen_devices list -nm
to see what devices are connected in the agent container.ValueError: benchmark testing error, latency cannot be nan
-
If you encounter the following error message, you probably have specified a training script path that contains no
train_script_main.py
. See Custom AI.ModuleNotFoundError: No module named 'train_script_main'
-
If you encounter the following error message, you might have run
XGen
in a directory that contains some PyTorch model files (*.pth
). Try to runXGen
in a different directory (e.g.,~/Projects/
).Fatal Python error: initsite: Failed to import the site module
-
If you encounter the following error message, try to run
sudo apt install gnupg2 pass
inside XGen shell to fix the issue.Error saving credentials: error storing credentials - err: ...
-
If you running
uninstall_xgen
command, and encounter the following error message,try to runError response from daemon: conflict: unable to delete foo (must be forced) - image is being used by stopped container bar
docker rm -f bar
to remove the container first, then rununinstall_xgen
again. -
Errors in network or network topology shows up when run
run_xgen
. If meet error like following:or error like this:docker: Error response from daemon: driver failed programming external connectivity on endpoint xgen_mq (de6f4cc60458d22e940a9f1a15abd98ae51a447e1e101fc72b1f8a10b99f8035): (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 55672 -j DNAT --to-destination 172.18.0.2:5672 ! -i br-39207ec16e43: iptables: No chain/target/match by that name. (exit status 1)).
It most likely that there is something to do with your network firewall rules which could be resulted from incorrect firewall service configuration.pymongo.errors.ServerSelectionTimeoutError: xgen_mongodb:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 654c4d2e01fe9ea16dbaa2bc, topology_type: Unknown, servers: [<ServerDescription ('xgen_mongodb', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('xgen_mongodb:27017: timed out')>]>
The simplest way is to just disable the firewall services or allow required ports listed in the installation page in your firewall rules or manually insert the rule to iptables (not recommended to manipulate iptables rule directly) For closing
firewalld
service, run command below:For closingsudo systemctl stop firewalld # stop firewall service sudo systemctl disable firewalld # disable firewall service sudo systemctl mask --now firewalld # mask firewall service sudo systemctl status firewalld # check firewall service status, should be inactive and masked
ufw
service, run command below:sudo systemctl stop ufw # stop firewall service sudo systemctl disable ufw # disable firewall service sudo systemctl mask --now ufw # mask firewall service sudo systemctl status ufw # check firewall service status, should be inactive and masked
Sometimes, local network rules may override Docker’s default network and bridge forwarding rules. For example, without specifying the
bip
anddefault-address-pools
for Docker, the default network segment generated by Docker is as follows:$ ifconfig
If the firewall rules happen to cover the 172.18.0.1 or 172.17.0.1 network segment, the above connection error message will also appear.br-6f017036e0bd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255 inet6 fe80::42:ebff:fe6d:1837 prefixlen 64 scopeid 0x20<link> ether 02:42:eb:6d:18:37 txqueuelen 0 (Ethernet) RX packets 22601980 bytes 5655627456 (5.6 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 30346252 bytes 7482474509 (7.4 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 inet6 fe80::42:fbff:fe36:365a prefixlen 64 scopeid 0x20<link> ether 02:42:fb:36:36:5a txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 23 bytes 2506 (2.5 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ...
For example, the following firewall nft rules,
$ sudo nft list ruleset
The rulechain nat_PREROUTING_ZONES { ip saddr 10.0.0.0/8 goto nat_PRE_public ip saddr 172.16.0.0/12 goto nat_PRE_public ip saddr 152.14.0.0/16 goto nat_PRE_public iifname "docker0" goto nat_PRE_docker iifname "br-08bf99834a22" goto nat_PRE_docker goto nat_PRE_public } ...
ip saddr 172.16.0.0/12 goto nat_PRE_public
covers all addresses from 172.16.0.0 to 172.31.255.255, so the data forwarding related to Docker will not be forwarded tonat_PRE_docker
, and XGen will display a connection error message.
There are two solutions, 1. Contact the system administrator to adjust the firewall rules to let Docker’s default network pass; 2. the other is to configure the network address for Docker to avoid firewall rules. Here is a brief introduction to setting Docker network settings, edit
/etc/docker/daemon.json
file, add the following content,```bash { "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } }, "bip": "192.168.249.1/24", "default-address-pools": [{ "base": "192.168.250.0/18", "size": 24 } ] } ``` The `bip` specified here is the address of `docker0`, and `default-address-pools` specifies the address range for Docker to create other `br`. For more detailed configuration information and meaning, you can refer to the network section description on Docker’s official website.
For other firewall services, please refer to their official documentation.
Please do not manipulate iptables rule directly unless you know what you are doing.
Please do not disable firewall service if you are not sure about the security risk.
Please do not simultaneously run multiple firewall services. They may conflict with or block rules of each other.
-
Error report associated with ADB version mismatch:
This error message is mostly shown up in the Device Lab Agent which is responsible for hosting edge devices. It is because the ADB version on the host machine is not compatible with the ADB version on the container. The ADB version on the host and container must be the same because of the C/S architecture of ADB. To check the version of ADB, please run commandThe host ADB version doesn't match the ADB version of XGen, please upgrade or degrade your ADB version to \$adb_version_on_container. You can install matched ADB version by the following command: 1. curl -s https://dl.google.com/android/repository/platform-tools_r33.0.3-linux.zip -o android-platform-tools.zip${NC}" 2. sudo unzip -o -q android-platform-tools.zip -d /usr/lib/android-sdk/${NC}" 3. sudo ln -sf /usr/lib/android-sdk/platform-tools/adb /usr/bin/adb${NC}"
adb version
. Please ensure the output of this command is the same as the version of ADB in the container. If not, please follow the instruction in the error message to install the matched ADB version.