华为鲲鹏920安装Tensorflow 2.4.1 从下载到安装完整备忘录
公司要求我在aarch64下运行tensorflow, 找真的不好找, 所以我就只能靠自己编译了, 编译过程很难受,各种问题各种痛苦
(English version translate by GPT-3.5)
说明
- 安装过程将全程在docker(CentOS 7)中进行,没安装conda, 使用python3.8环境(当然我也实测过可以在python3.6下安装)
- 将使用gcc5.5进行安装
- 中途的报错我也是完整记录, 该文章为日志文, 不建议当教程使用
- Docker中的CentOS 7为Docker Hub中的CentOS 7原生镜像
创建Docker容器, 准备安装
先创建Docker容器
我将会用下面的命令行来创建一个全新的 容器为centos7的docker镜像, 这个centos7为官方
1 | docker run -d -p 9222:22 --name=tensorflow-test -e "container=docker" --privileged=true --restart=always centos:7 /usr/sbin/init |
然后, 进入这个docker容器中
1 | docker exec -it 48097947e31b bash |
执行命令和返回结果如下
1 | [root@ecs-111 ~]# docker run -d -p 9222:22 --name=hanlp-tester -e "container=docker" --privileged=true --restart=always centos:7 /usr/sbin/init |
安装必要组件
然后, 连接进入docker里面, 使用yum安装必要组件, 其中包括ssh工具, 因为新创的centos7容器, 干净得很
1 | yum install wget curl telnet make net-tools initscripts sudo su openssh-server openssh-clients openssl-devel openssl zlib-devel -y |
稍等一会, 就完成了, 别忘了改个root密码,开启openssh
1 | [root@48097947e31b download]# passwd root |
编译Python3.8并安装
编译耗时: 约8分钟
下载Python3.8源码, 并进行编译
我们从这里下载 Python 3.8版本 https://www.python.org/ftp/python/3.8.7/Python-3.8.7.tgz
1 | wget -c https://www.python.org/ftp/python/3.8.7/Python-3.8.7.tgz |
执行的输出如下
1 | [root@48097947e31b download]# https_proxy= wget -c https://www.python.org/ftp/python/3.8.7/Python-3.8.7.tgz |
解压, 并进行编译
解压Python-3.8.7.tgz
1
tar -zxvf Python-3.8.7.tgz
安装并启用gcc7 (使用 scl方式,其实后面基本用gcc5.5编译的。。。)
1
2
3yum install centos-release-scl -y
yum install devtoolset-7 -y
scl enable devtoolset-7 bash验证GCC版本
1
2
3
4
5[root@48097947e31b Python-3.8.7]# gcc --version
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.开始编译Python
make -j6 表示用6个线程进行同时编译, 提升编译速度,前提是这会消耗大量的内存和CPU,如果CPU核数高但是内存不多的可能会报内存不足。
1
2
3./configure --with-ssl-default-suites=openssl --enable-optimizations
make -j6
make install安装成功, 命令行输入python3, 编译安装结束
1
2
3
4
5[root@48097947e31b Python-3.8.7]# python3
Python 3.8.7 (default, Feb 5 2021, 05:34:43)
[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
准备好Tensorflow
参考的官网是 从源代码构建 | TensorFlow
更新 pip
这里开始得用 pip3,为了加速pip下载, 先把镜像改成aliyun的。不然等死人
1 | pip3 install -i https://mirrors.aliyun.com/pypi/simple/ pip -U |
输出如下
1 | [root@48097947e31b Python-3.8.7]# pip3 install -i https://mirrors.aliyun.com/pypi/simple/ pip -U |
按官网要求, 安装依赖
1 | pip3 install -U --user pip numpy wheel |
控制台输出如下
1 | [root@48097947e31b Python-3.8.7]# pip install -U --user pip numpy wheel |
先下载tensorflow的源码包
从Tensorflow的github上下载 TensorFlow 2.4.1.zip 67MB, 不clone master版本, 使用最后一次的release版, 并解压
1
2wget -c https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
unzip v2.4.1.zip控制台输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26[root@48097947e31b download]# wget -c https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
--2021-02-05 06:02:22-- https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
Connecting to 127.0.0.1:7890... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://codeload.github.com/tensorflow/tensorflow/zip/v2.4.1 [following]
--2021-02-05 06:02:23-- https://codeload.github.com/tensorflow/tensorflow/zip/v2.4.1
Connecting to 127.0.0.1:7890... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: 'v2.4.1.zip'
[ <=> ] 16,283,488 1.99MB/s
2021-02-05 06:03:53 (765 KB/s) - 'v2.4.1.zip' saved [69346072]
[root@48097947e31b download]# unzip v2.4.1.zip
Archive: v2.4.1.zip
85c8b2a817f95a3e979ecd1ed95bff1dc1335cff
creating: tensorflow-2.4.1/
inflating: tensorflow-2.4.1/.bazelrc
extracting: tensorflow-2.4.1/.bazelversion
creating: tensorflow-2.4.1/.github/
creating: tensorflow-2.4.1/.github/ISSUE_TEMPLATE/
...
inflating: tensorflow-2.4.1/tools/tf_env_collect.sh
finishing deferred symbolic links:
tensorflow-2.4.1/.pylintrc -> tensorflow/tools/ci_build/pylintrc
[root@48097947e31b download]#按照官方文档 安装Bazel这一段, 说要安装bazelink可以快速安装合适的bazel
1
原文: 您需要安装 Bazel,才能构建 TensorFlow。您可以使用 Bazelisk 轻松安装 Bazel,并且 Bazelisk 可以自动为 TensorFlow 下载合适的 Bazel 版本。为便于使用,请在 PATH 中将 Bazelisk 添加为 bazel 可执行文件。
我看了下, 并没有bazelink for aarch64架构的bin, 所以我选择手动编译
手动编译 Bazel
忽略报错, 不算gcc5编译,编译耗时: 约3分钟
下载Bazel合适的版本
前往Tensorflow官网 - 安装Bazel, 然后跟着指引慢慢来
注意网页上这句话: 请务必安装受支持的 Bazel 版本,可以是 tensorflow/configure.py 中指定的介于 _TF_MIN_BAZEL_VERSION 和 _TF_MAX_BAZEL_VERSION 之间的任意版本。
然后,去这个文件夹看下, 它需要什么目录, 内容如下
1 | cat /root/download/tensorflow-2.4.1/configure.py |
最低3.1.0, 最高3.99.0, 那就下3.1.0就成
选择下载文件 bazel-3.1.0-dist.zip 257 MB 文件来自于 Release 3.1.0 · bazelbuild/bazel
解压 bazel-3.1.0-dist.zip
1 | wget -c https://github.com/bazelbuild/bazel/releases/download/3.1.0/bazel-3.1.0-dist.zip |
编译Bazel
安装依赖软件 Java, 这里选择Java11
1
yum install java-11-openjdk java-11-openjdk-devel -y
执行编译
这里强烈建议使用代理, 或者自行下载文件到对应目录下, 不然速度慢的很
1
2cd bazel-dist
./compile.sh控制台输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20[root@48097947e31b bazel-dist]# ./compile.sh
🍃 Building Bazel from scratch.. (这一步预计要3分钟)
🍃 Building Bazel with Bazel.(这里开始非常耗时间)
DEBUG: /tmp/bazel_ZlOoVY5r/out/external/bazel_toolchains/rules/rbe_repo/version_check.bzl:59:9:
...
bazel_tools/tools/jdk/include/linux' resolves to 'external/bazel_tools/tools/jdk/include/linux' not below the relative path of its package 'src/main/java/com/google/devtools/build/lib/syntax'. This will be an error in the future
Analyzing: target //src:bazel_nojdk (275 packages loaded, 8010 targets configu\
red)
Fetching @remotejdk11_linux_aarch64; fetching 36s
Fetching https://mirror.bazel.build/...nux_aarch64.tar.gz; 67,991,108B 34s (视网络情况而定, 412MB, 文件下载地址 https://mirror.bazel.build/openjdk/azul-zulu11.37.48-ca-jdk11.0.6/zulu11.37.48-ca-jdk11.0.6-linux_aarch64.tar.gz)
...
[300 / 1,450] 8 actions, 7 running (这一步会等待较长时间,)
JavacBootstrap .../buildjar/libskylark-deps.jar [for host]; 14s local
@com_google_protobuf//:protobuf; 2s local
@com_google_protobuf//:protobuf; 1s local
@com_google_protobuf//:protobuf; 1s local
@com_google_protobuf//:protobuf; 0s local
@com_google_protobuf//:protobuf; 0s local
@com_google_protobuf//:protobuf; 0s local
[Scann] @com_google_protobuf//:protobuf出现了报错
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15ERROR: /tmp/bazel_ZlOoVY5r/out/external/bazel_tools/third_party/ijar/BUILD:72:1: Linking of rule '@bazel_tools//third_party/ijar:zipper' failed (Exit 1): gcc failed: error executing command
(cd /tmp/bazel_ZlOoVY5r/out/execroot/io_bazel && \
exec env - \
LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/opt/rh/devtoolset-7/root/usr/lib64/dyninst:/opt/rh/devtoolset-7/root/usr/lib/dyninst:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib \
PATH=/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin \
PWD=/proc/self/cwd \
/opt/rh/devtoolset-7/root/usr/bin/gcc @bazel-out/host/bin/external/bazel_tools/third_party/ijar/zipper-2.params)
Execution platform: //:default_host_platform
bazel-out/host/bin/external/bazel_tools/src/main/cpp/util/_objs/filesystem/path_posix.o:path_posix.cc:function blaze_util::SplitPath(std::string const&): error: undefined reference to 'std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&, unsigned long, std::allocator<char> const&)'
bazel-out/host/bin/external/bazel_tools/src/main/cpp/util/_objs/filesystem/path_posix.o:path_posix.cc:function blaze_util::SplitPath(std::string const&): error: undefined reference to 'std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&, unsigned long, std::allocator<char> const&)'
collect2: error: ld returned 1 exit status
Target //src:bazel_nojdk failed to build
INFO: Elapsed time: 365.761s, Critical Path: 54.37s
INFO: 659 processes: 659 local.
FAILED: Build did NOT complete successfully似乎查了半天没有找到解决方案, 解决不掉。。。我试着装个gcc5.5版本试试看(编译耗时 约25分钟)
1
2
3
4
5
6
7wget -c http://ftp.gnu.org/gnu/gcc/gcc-5.5.0/gcc-5.5.0.tar.gz
tar -zxvf gcc-5.5.0.tar.gz
cd gcc-5.5.0
yum install gmp-devel mpfr-devel libmpc-devel -y
./configure --prefix=/usr/local/gcc5.5
make -j
make install再次编译
1
PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin LD_LIBRARY_PATH=/usr/local/gcc5.5/lib64:/usr/local/gcc5.5/lib ./compile.sh
控制台输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17[root@48097947e31b bazel-dist]# PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin LD_LIBRARY_PATH=/usr/local/gcc5.5/lib64:/usr/local/gcc5.5/lib ./compile.sh
🍃 Building Bazel from scratch......
🍃 Building Bazel with Bazel.
...
Analyzing: target //src:bazel_nojdk (275 packages loaded, 8010 targets configu\
red)
Fetching @remotejdk11_linux_aarch64; fetching 11s
Fetching https://mirror.bazel.build/...linux_aarch64.tar.gz; 8,101,840B 9s(412MB, 这里重新下载有点无语,1.5MB/s情况下约 5分钟)
[104 / 1,430] 8 actions running
JavacBootstrap .../buildjar/libskylark-deps.jar [for host]; 22s local
@com_google_protobuf//:protobuf; 7s local
@com_google_protobuf//:protobuf; 4s local
@com_google_protobuf//:protobuf_lite; 3s local
@com_google_protobuf//:protobuf; 2s local
@com_google_protobuf//:protobuf_lite; 0s local
@com_google_protobuf//:protobuf_lite; 0s local
@com_google_protobuf//:protobuf; 0s local再次报错
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17ERROR: /root/download/bazel-dist/src/main/protobuf/BUILD:202:1: Action src/main/protobuf/command_server.grpc.pb.h failed (Exit 1): protoc failed: error executing command
(cd /tmp/bazel_fsjZQa4T/out/execroot/io_bazel && \
exec env - \
bazel-out/host/bin/external/com_google_protobuf/protoc '--plugin=protoc-gen-PLUGIN=bazel-out/host/bin/third_party/grpc/cpp_plugin' '--PLUGIN_out=bazel-out/aarch64-opt/bin' '--proto_path=.' '--proto_path=.' '--proto_path=bazel-out/aarch64-opt/bin/external/com_google_protobuf/_virtual_imports/descriptor_proto' '--proto_path=bazel-out/aarch64-opt/bin' src/main/protobuf/command_server.proto)
Execution platform: //:default_host_platform
/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)
/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)
/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)
Target //src:bazel_nojdk failed to build
ERROR: /root/download/bazel-dist/src/main/cpp/BUILD:98:1 Action src/main/protobuf/command_server.grpc.pb.h failed (Exit 1): protoc failed: error executing command
(cd /tmp/bazel_fsjZQa4T/out/execroot/io_bazel && \
exec env - \
bazel-out/host/bin/external/com_google_protobuf/protoc '--plugin=protoc-gen-PLUGIN=bazel-out/host/bin/third_party/grpc/cpp_plugin' '--PLUGIN_out=bazel-out/aarch64-opt/bin' '--proto_path=.' '--proto_path=.' '--proto_path=bazel-out/aarch64-opt/bin/external/com_google_protobuf/_virtual_imports/descriptor_proto' '--proto_path=bazel-out/aarch64-opt/bin' src/main/protobuf/command_server.proto)
Execution platform: //:default_host_platform
INFO: Elapsed time: 370.840s, Critical Path: 58.40s
INFO: 616 processes: 589 local, 27 worker.
FAILED: Build did NOT complete successfully不错, 这次报错有很明显的错误点了
1
2
3/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)
/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)
/tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /tmp/bazel_fsjZQa4T/out/execroot/io_bazel/bazel-out/host/bin/external/com_google_protobuf/protoc)错误原因是
/lib64/libstdc++.so.6
中没有GLIBCXX_3.4.20版本, 检查下刚刚编译出来的/usr/local/gcc5.5/lib64/libstdc++.so.6
支持的GLIBCXX版本1
strings /usr/local/gcc5.5/lib64/libstdc++.so.6 | grep ^GLIBCXX
控制台返回
1
2
3
4
5
6
7
8
9[root@48097947e31b bazel-dist]# strings /usr/local/gcc5.5/lib64/libstdc++.so.6 | grep ^GLIBCXX
GLIBCXX_3.4
...
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20 #### 目标版本
GLIBCXX_3.4.21 #### 目标版本
...
GLIBCXX_3.4.4再检查 CXXABI_1.3.8版本是否在
1
strings /usr/local/gcc5.5/lib64/libstdc++.so.6 | grep ^CXXABI
控制台返回
1
2
3
4
5[root@48097947e31b bazel-dist]# strings /usr/local/gcc5.5/lib64/libstdc++.so.6 | grep ^CXXABI
...
CXXABI_1.3.8 #### 目标版本
...
CXXABI_1.3.3即gcc5.5中包含了 CXXABI_1.3.8, GLIBCXX_3.4.20, 那把这个文件链接到 /lib64就成了
1
2unlink /lib64/libstdc++.so.6
ln -s /usr/local/gcc5.5/lib64/libstdc++.so.6 /lib64/libstdc++.so.6控制台输出
1
2
3
4
5
6
7[root@48097947e31b bazel-dist]# ll /lib64/libstdc++.so.6
lrwxrwxrwx 1 root root 19 Nov 13 01:54 /lib64/libstdc++.so.6 -> libstdc++.so.6.0.19
[root@48097947e31b bazel-dist]# unlink /lib64/libstdc++.so.6
[root@48097947e31b bazel-dist]# ln -s /usr/local/gcc5.5/lib64/libstdc++.so.6 /lib64/libstdc++.so.6
[root@48097947e31b bazel-dist]# ll /lib64/libstdc++.so.6
lrwxrwxrwx 1 root root 38 Feb 5 08:29 /lib64/libstdc++.so.6 -> /usr/local/gcc5.5/lib64/libstdc++.so.6
[root@48097947e31b bazel-dist]#再再次编译,命令行不发了, 同上。。。(又重下那个412MB的文件。。。都想构建本地http了。。。), 控制台输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15[root@48097947e31b bazel-dist]# PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin LD_LIBRARY_PATH=/usr/local/gcc5.5/lib64:/usr/local/gcc5.5/lib https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890 ./compile.sh
🍃 Building Bazel from scratch......
🍃 Building Bazel with Bazel.
....
Fetching https://mirror.bazel.build/...linux_aarch64.tar.gz; 8,101,840B 9s(412MB,后面省略)
INFO: Found 1 target...
[0 / 1,066] [Prepa] Writing file src/embedded_tools_nojdk.params
....
[323 / 1,483] 8 actions running
[326 / 1,483] 8 actions running
INFO: From JavacBootstrap src/java_tools/buildjar/java/com/google/devtools/build/buildjar/libskylark-deps.jar [for host]:
.....
DEBUG: /tmp/bazel_3S4oONKX/out/external/build_bazel_rules_nodejs/internal/common/check_bazel_version.bzl:49:9: Make sure that you are running at least Bazel 0.21.0.
Build successful! Binary is here: /root/download/bazel-dist/output/bazel编译成功, 把bazel包复制到/usr/local/bin中
1
cp /root/download/bazel-dist/output/bazel /usr/local/bin/
控制台输出
1
2
3
4[root@48097947e31b bazel-dist]# cp /root/download/bazel-dist/output/bazel /usr/local/bin/
[root@48097947e31b bazel-dist]# bazel --version
bazel 3.1.0- (@non-git)
[root@48097947e31b bazel-dist]#
开始编译Tensorflow 2.4.1(非常耗时)
忽略报错, 不算gcc5编译,编译耗时: 约2小时40分钟
配置构建
输入如下命令, 按照自己需求进行选项, 我的选项如下
1 | ./configure |
我选的
1 | [root@48097947e31b tensorflow-2.4.1]# ./configure |
构建 pip 软件包
下载依赖软件git
1
yum install git -y
使用bazel build进行构建
最简单的命令是
1
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
但是为了保证gcc正确, 我加了gcc的PATH, 也是用gcc5编译, 加上
--local_ram_resources=2048
防止内存炸了1
PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin LD_LIBRARY_PATH=/usr/local/gcc5.5/lib64:/usr/local/gcc5.5/lib bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --verbose_failures --local_ram_resources=2048
控制台输出
1
2
3
4
5
6
7
8
9
10
11
12
13[root@48097947e31b tensorflow-2.4.1]# PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890 bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --verbose_failures --local_ram_resources=2048
...
open_source_build=true --java_toolchain=//third_party/toolchains/java:tf_java_toolchain --host_java_toolchain=//third_party/toolchains/java:tf_java_toolchain --define=tensorflow_enable_mlir_generated_gpu_kernels=0 --
...
Analyzing: target //tensorflow/tools/pip_package:build_pip_package (16 packages loaded, 439 targets configured)
Fetching @go_sdk; fetching 49s
Fetching @boringssl; fetching 49s
Fetching @llvm-project; fetching 49s
Fetching @aws; fetching 49s
Fetching https://storage.googleapis.com/mirror.tensorflow.org/github.com/google/boringssl/archive/80ca9f9f6ece29ab132cce4cf807a9465a18cfac.tar.gz; 15,210,540B 48s
Fetching https://dl.google.com/go/go1.12.5.linux-arm64.tar.gz; 13,590,432B 46s
Fetching https://github.com/aws/aws-sdk-cpp/archive/1.7.336.tar.gz; 11,919,054B 44s
Fetching https://github.com/llvm/llvm-project/archive/f402e682d0ef5598eeffc9a21a691b03e602ff58.tar.gz; 12,222,637B 44s ... (16 fetches)中间可能会报错, 如果错误类型是
Error downloading
的错误, 就是网络炸了, 多运行几次即可(p.s. 这段下载简直是实习生写的, 本身国内网络慢, 还十几个文件一起下载, 然后timeout概率爆炸, timeout后还不会去重试, 直接checksum, timeout后文件肯定不完整, 结果, 十几个文件中1个文件失败就会所有文件直接停止下载并报错,哪怕其他文件还在下载,而且我试过将文件下载到指定目录, 结果运行后还是会重新下载该文件。)例如像下面的错误,
1
2
3
4
5
6
7
8
9
10ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/external/io_bazel_rules_go/go/private/sdk.bzl", line 50
_remote_sdk(ctx, <3 more arguments>)
File "/root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/external/io_bazel_rules_go/go/private/sdk.bzl", line 120, in _remote_sdk
ctx.download(url = urls, <2 more arguments>)
java.io.IOException: Error downloading [https://dl.google.com/go/go1.12.5.linux-arm64.tar.gz] to /root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/external/go_sdk/go_sdk.tar.gz: Checksum was aac5b83aa2838dcc69817d928b4921ad1db945da9bbc70dd3e55a48ad7259505 but wanted ff09f34935cd189a4912f3f308ec83e4683c309304144eae9cf60ebc552e7cd8
INFO: Elapsed time: 31.488s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (202 packages loaded, 3829 targets configured)
Fetching @io_bazel_rules_docker; Cloning 251f6a68b439744094faff800cd029798edf9faa of https://github.com/bazelbuild/rules_docker.git 27s冒着怒火, 相同命令运行了近10次, 终于全部文件下载成功了。。进入编译。。。这段时间非常漫长, 预计1-2小时, 建议使用screen或nohup方式编译, 防止控制台断开失败
1
2
3
4
5
6
7
8
9
10
11INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (177 packages loaded, 26606 targets configured).
INFO: Found 1 target...
[103 / 556] 8 actions, 7 running # 别被这里欺骗了, 一共约1.5万个, 它会不断增加
Compiling external/com_googlesource_code_re2/re2/set.cc [for host]; 1s local
Compiling external/com_googlesource_code_re2/re2/parse.cc [for host]; 0s local
Compiling external/com_googlesource_code_re2/re2/re2.cc [for host]; 0s local
Compiling external/com_google_absl/absl/time/duration.cc [for host]; 0s local
Compiling external/com_googlesource_code_re2/re2/compile.cc [for host]; 0s local
Compiling external/com_googlesource_code_re2/re2/bitstate.cc [for host]; 0s local
Compiling external/com_googlesource_code_re2/re2/onepass.cc [for host]; 0s local
[Scann] Compiling external/com_googlesource_code_re2/re2/dfa.cc [for host]编译到一半的时候,又报错了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38ERROR: /root/download/tensorflow-2.4.1/tensorflow/python/keras/api/BUILD:111:1: Executing genrule //tensorflow/python/keras/api:keras_python_api_gen failed (Exit 1): bash failed: error executing command
(cd /root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/execroot/org_tensorflow && \
exec env - \
PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin \
PYTHON_BIN_PATH=/usr/local/bin/python3 \
PYTHON_LIB_PATH=/usr/local/lib/python3.8/site-packages \
TF2_BEHAVIOR=1 \
TF_CONFIGURE_IOS=0 \
/bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen --apidir=bazel-out/aarch64-opt/bin/tensorflow/python/ke
ras/api --apiname=keras --apiversion=1 --loading=default --package=tensorflow.python,tensorflow.python.keras,tensorflow.python.keras.activations,tensorflow.python.keras.applications.densenet,tensorflow.python.keras.applications.
efficientnet,tensorflow.python.keras.applications.imagenet_utils,tensorflow.python.keras.applications.inception_resnet_v2,tensorflow.python.keras.applications.inception_v3,tensorflow.python.keras.applications.mobilenet,tensorflow
...
scikit_learn/__init__.py')
Execution platform: @local_execution_config_platform//:platform
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 26, in <module>
from tensorflow.python.tools.api.generator import doc_srcs
File "/root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 28, in <module>
import ctypes
File "/usr/local/lib/python3.8/ctypes/__init__.py", line 7, in <module>
from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
ERROR: /root/download/tensorflow-2.4.1/tensorflow/python/tools/BUILD:82:1 Executing genrule //tensorflow/python/keras/api:keras_python_api_gen failed (Exit 1): bash failed: error executing command
(cd /root/.cache/bazel/_bazel_root/6dc75189e5d225c7bfaf488577a89e58/execroot/org_tensorflow && \
exec env - \
PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin \
PYTHON_BIN_PATH=/usr/local/bin/python3 \
PYTHON_LIB_PATH=/usr/local/lib/python3.8/site-packages \
TF2_BEHAVIOR=1 \
TF_CONFIGURE_IOS=0 \
/bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_keras_python_api_gen --apidir=bazel-out/aarch64-opt/bin/tensorflow/python/keras/api --apiname=keras --apiversion=1 --loading=default --package=tensorflow.python,tensorflow.python.keras,tensorflo
...
python/keras/api/keras/wrappers/__init__.py bazel-out/aarch64-opt/bin/tensorflow/python/keras/api/keras/wrappers/scikit_learn/__init__.py')
Execution platform: @local_execution_config_platform//:platform
INFO: Elapsed time: 9034.655s, Critical Path: 5345.41s
INFO: 13843 processes: 13843 local.
FAILED: Build did NOT complete successfully错误很明显,
ModuleNotFoundError: No module named '_ctypes'
查询后, 得知需要安装
libffi-devel
, 然后, 还得编译一次python。。。所以。。
1
2
3
4
5yum install libffi-devel -y
cd ../Python-3.8.7
./configure --with-ssl-default-suites=openssl --enable-optimizations
make -j
make install完成后, 看下Python能否import
from _ctypes import Union, Structure, Array
了1
2
3
4
5
6[root@48097947e31b Python-3.8.7]# python3
Python 3.8.7 (default, Feb 5 2021, 13:40:10)
[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from _ctypes import Union, Structure, Array
>>>ok, 继续编译 TensorFlow
1
PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --verbose_failures --local_ram_resources=2048
经过漫长的等待,最后的最后,控制台输出
1
2
3
4
5
6INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (177 packages loaded, 26606 targets configured).
......
INFO: Found 1 target...
[15,493 / 15,513] 8 actions, 4 running Executing genrule //tensorflow/python/keras/api:keras_python_api_gen_compat_v2; 7s local
Target //tensorflow/tools/pip_package:build_pip_package up-to-date: bazel-bin/tensorflow/tools/pip_package/build_pip_package INFO: Elapsed time: 9638.762s, Critical Path: 7750.25s
INFO: 14845 processes: 14845 local. INFO: Build completed successfully, 14992 total actions编译终于成功了, 总耗时 2小时40分钟39秒。。
生成whl文件安装Tensorflow
参考官方引导 从源代码构建 | TensorFlow - 构建软件包
执行如下命令
1 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg |
控制台输出
1 | [root@48097947e31b tensorflow-2.4.1]# ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg |
至此, Tensorflow编译过程结束
安装Tensorflow
1 | pip3 install /tmp/tensorflow_pkg/tensorflow-2.4.1-cp38-cp38-linux_aarch64.whl |
然后又出现错误
1 | [root@48097947e31b tensorflow-2.4.1]# pip3 /tmp/tensorflow_pkg/tensorflow-2.4.1-cp38-cp38-linux_aarch64.whl |
从中看出有以下错误, grpcio的错误, 以及h5py的错误
1 | ... |
首先, 修复grpcio 的错误, 既然自动安装不了, 就从pip中下载来, 手动安装下
1 | wget -c https://files.pythonhosted.org/packages/0e/5f/eeb402746a65839acdec78b7e757635f5e446138cc1d68589dfa32cba593/grpcio-1.32.0.tar.gz |
安装grpcio返回错误
1 | [root@48097947e31b ~]# cd grpcio-1.32.0 |
添加之前安装的gcc5.5的环境变量后(其实也可以把/usr/local/gcc5.5/bin放到环境变量中。。。), 重新执行 python3 setup.py install
1 | PATH=/usr/local/gcc5.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin python3 setup.py install |
控制台返回
1 | Installed /usr/local/lib/python3.8/site-packages/grpcio-1.32.0-py3.8-linux-aarch64.egg |
然后尝试安装h5py(这里选择2.10.0, 是因为, tensorflow在安装的时候, 就是下载这个版本的)
1 | wget -c https://files.pythonhosted.org/packages/5f/97/a58afbcf40e8abecededd9512978b4e4915374e5b80049af082f49cebe9a/h5py-2.10.0.tar.gz |
控制台返回
1 | ... |
看样子得安装hdf5,找到官网, 下载 hdf5-1.12.0.tar.gz, 我先把gcc5.5放到环境变量好了,每次都PATH下太累了。。
1 | echo 'export PATH=/usr/local/gcc5.5/bin:$PATH' >> /etc/profile |
控制台输出
1 | [root@48097947e31b download]# echo 'export PATH=/usr/local/gcc5.5/bin:$PATH' >> /etc/profile |
继续安装 hdf5
1 | wget -c https://hdf-wordpress-1.s3.amazonaws.com/wp-content/uploads/manual/HDF5/HDF5_1_12_0/source/hdf5-1.12.0.tar.gz # 这个地址最好去上面访问下 |
回到h5py目录
1 | LD_LIBRARY_PATH=/usr/local/hdf5/lib C_INCLUDE_PATH=/usr/local/hdf5/include python3 setup.py install |
报错。。。
1 | In file included from /usr/local/hdf5/include/H5public.h:32:0, |
从Why is HDF5 giving a “too few arguments” error here?回答, 似乎hdf5版本太高了,降一个版本, 在网站上下载 hdf5-1.10.7.tar.gz看看
1 | https_proxy=http://127.0.0.1:7890/ wget -c https://hdf-wordpress-1.s3.amazonaws.com/wp-content/uploads/manual/HDF5/HDF5_1_10_7/src/hdf5-1.10.7.tar.gz |
回到h5py目录, 然后再进行h5py的安装
1 | LD_LIBRARY_PATH=/usr/local/hdf5-1.10/lib C_INCLUDE_PATH=/usr/local/hdf5-1.10/include python3 setup.py install |
控制台错误
1 | [root@48097947e31b h5py-2.10.0]# python3 setup.py install |
大意是,读不到hdf5和hdf5_hl,LD_LIBRARY_PATH没生效,看上面人家只从 /opt/local/lib,/usr/local/lib中读取,所以只能将这2个so文件链接到人家能读到的目录下,应该就成了
1 | ln -s /usr/local/hdf5-1.10/lib/libhdf5.so /usr/local/lib/ |
再运行一次
1 | python3 setup.py install |
控制台输出
1 | ... |
成功了, 重新安装tensorflow
1 | pip3 install /tmp/tensorflow_pkg/tensorflow-2.4.1-cp38-cp38-linux_aarch64.whl |
控制台输出
1 | Installing collected packages: grpcio, google-auth-oauthlib, absl-py, wrapt, typing-extensions, termcolor, tensorflow-estimator, tensorboard, opt-einsum, google-pasta, gast, flatbuffers, astunparse, tensorflow |
至此,安装结束。。。