Skip to content

FAQ

Problem Diagnosis

  1. When filing an error report, please include the log file, which typically resides in a subfolder under the output folder of XGen (~/output inside XGen container and Output/ in XGen folder on the host) (see XGen Results).

  2. If you have any problems pulling docker images, make sure you have logged into the Docker repository and that you have the correct permissions. See installation instructions.

  3. If you get an error message about the GPU, ensure that Nvidia-Driver and Nvidia-Docker are correctly installed and their versions meet the requirements. See installation instructions

  4. If after XGen is started, it lists no devices, please check the connections and setup of the devices. See installation instructions. Use xgenctl status in XGen, Agent and Controller shell to see the services status. If there is a service failure or exit, try using the command xgenctl restart to restart the service.

  5. If the following error appears, it is possible that your devices are disconnected. Please check the connections and setup of the devices. See installation instructions. Use xgen_devices list -nm to see what devices are connected in the agent container.

    ValueError: benchmark testing error, latency cannot be nan
    
  6. If you encounter the following error message, you probably have specified a training script path that contains no train_script_main.py . See Custom AI.

    ModuleNotFoundError: No module named 'train_script_main'
    
  7. If you encounter the following error message, you might have run XGen in a directory that contains some PyTorch model files (*.pth). Try to run XGen in a different directory (e.g., ~/Projects/).

    Fatal Python error: initsite: Failed to import the site module
    
  8. If you encounter the following error message, try to run sudo apt install gnupg2 pass inside XGen shell to fix the issue.

    Error saving credentials: error storing credentials - err: ...
    
  9. If you running uninstall_xgen command, and encounter the following error message,

    Error response from daemon: conflict: unable to delete foo (must be forced) - image is being used by stopped container bar
    
    try to run docker rm -f bar to remove the container first, then run uninstall_xgen again.

  10. Errors in network or network topology shows up when run run_xgen. If meet error like following:

    docker: Error response from daemon: driver failed programming external connectivity on endpoint xgen_mq (de6f4cc60458d22e940a9f1a15abd98ae51a447e1e101fc72b1f8a10b99f8035):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 55672 -j DNAT --to-destination 172.18.0.2:5672 ! -i br-39207ec16e43: iptables: No chain/target/match by that name.
    (exit status 1)).
    
    or error like this:
    pymongo.errors.ServerSelectionTimeoutError: xgen_mongodb:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 654c4d2e01fe9ea16dbaa2bc, topology_type: Unknown, servers: [<ServerDescription ('xgen_mongodb', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('xgen_mongodb:27017: timed out')>]>
    
    It most likely that there is something to do with your network firewall rules which could be resulted from incorrect firewall service configuration.

    The simplest way is to just disable the firewall services or allow required ports listed in the installation page in your firewall rules or manually insert the rule to iptables (not recommended to manipulate iptables rule directly) For closing firewalld service, run command below:

    sudo systemctl stop firewalld # stop firewall service
    sudo systemctl disable firewalld # disable firewall service
    sudo systemctl mask --now firewalld # mask firewall service
    sudo systemctl status firewalld # check firewall service status, should be inactive and masked
    
    For closing ufw service, run command below:
    sudo systemctl stop ufw # stop firewall service
    sudo systemctl disable ufw # disable firewall service
    sudo systemctl mask --now ufw # mask firewall service
    sudo systemctl status ufw # check firewall service status, should be inactive and masked
    

    Sometimes, local network rules may override Docker’s default network and bridge forwarding rules. For example, without specifying the bip and default-address-pools for Docker, the default network segment generated by Docker is as follows:

    $ ifconfig
    
    br-6f017036e0bd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        inet6 fe80::42:ebff:fe6d:1837  prefixlen 64  scopeid 0x20<link>
        ether 02:42:eb:6d:18:37  txqueuelen 0  (Ethernet)
        RX packets 22601980  bytes 5655627456 (5.6 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30346252  bytes 7482474509 (7.4 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
            inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
            inet6 fe80::42:fbff:fe36:365a  prefixlen 64  scopeid 0x20<link>
            ether 02:42:fb:36:36:5a  txqueuelen 0  (Ethernet)
            RX packets 0  bytes 0 (0.0 B)
            RX errors 0  dropped 0  overruns 0  frame 0
            TX packets 23  bytes 2506 (2.5 KB)
            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...
    
    If the firewall rules happen to cover the 172.18.0.1 or 172.17.0.1 network segment, the above connection error message will also appear.

    For example, the following firewall nft rules,

    $ sudo nft list ruleset
    
    chain nat_PREROUTING_ZONES {
    ip saddr 10.0.0.0/8 goto nat_PRE_public
    ip saddr 172.16.0.0/12 goto nat_PRE_public
    ip saddr 152.14.0.0/16 goto nat_PRE_public
    iifname "docker0" goto nat_PRE_docker
    iifname "br-08bf99834a22" goto nat_PRE_docker
    goto nat_PRE_public
    }
    ...
    
    The rule ip saddr 172.16.0.0/12 goto nat_PRE_public covers all addresses from 172.16.0.0 to 172.31.255.255, so the data forwarding related to Docker will not be forwarded to nat_PRE_docker, and XGen will display a connection error message.


    There are two solutions, 1. Contact the system administrator to adjust the firewall rules to let Docker’s default network pass; 2. the other is to configure the network address for Docker to avoid firewall rules. Here is a brief introduction to setting Docker network settings, edit /etc/docker/daemon.json file, add the following content,

    ```bash
    {
        "runtimes": {
            "nvidia": {
                "args": [],
                "path": "nvidia-container-runtime"
            }
    },
    "bip": "192.168.249.1/24",
    "default-address-pools": [{ "base": "192.168.250.0/18", "size": 24 }  ]
    }
    ```
    
     The `bip` specified here is the address of `docker0`, and `default-address-pools` specifies the address range for Docker to create other `br`. For more detailed configuration information and meaning, you can refer to the network section description on Docker’s official website.
    

    For other firewall services, please refer to their official documentation.

    Please do not manipulate iptables rule directly unless you know what you are doing.

    Please do not disable firewall service if you are not sure about the security risk.

    Please do not simultaneously run multiple firewall services. They may conflict with or block rules of each other.

  11. Error report associated with ADB version mismatch:

        The host ADB version doesn't match the ADB version of XGen, please upgrade or degrade your ADB version to \$adb_version_on_container.
        You can install matched ADB version by the following command:
        1. curl -s https://dl.google.com/android/repository/platform-tools_r33.0.3-linux.zip -o android-platform-tools.zip${NC}"
        2. sudo unzip -o -q android-platform-tools.zip -d /usr/lib/android-sdk/${NC}"
        3. sudo ln -sf /usr/lib/android-sdk/platform-tools/adb /usr/bin/adb${NC}"
    
    This error message is mostly shown up in the Device Lab Agent which is responsible for hosting edge devices. It is because the ADB version on the host machine is not compatible with the ADB version on the container. The ADB version on the host and container must be the same because of the C/S architecture of ADB. To check the version of ADB, please run command adb version. Please ensure the output of this command is the same as the version of ADB in the container. If not, please follow the instruction in the error message to install the matched ADB version.