当前位置: 首页 > news >正文

【Dolphinscheduler】docker搭建dolphinscheduler集群并与安全的CDH集成

1 文档编写目的

Apache DolphinScheduler(以下简称 DS)是一个分布式、去中心化、易扩展的可视化 DAG 工作流任务调度平台,致力于解决数据处理流程中复杂的依赖关系,使调度系统开箱即用。本篇文档主要介绍如何在 CDH 安全集群环境中部署 DolphinScheduler 集群。

在这里插入图片描述

高可靠性
采用去中心化架构,Master 与 Worker 节点对等部署,避免单点瓶颈和性能压力。内置任务缓冲队列,有效应对高并发任务,确保系统稳定运行。

简单易用
提供可视化 DAG 流程定义界面,支持拖拽式任务编排,实时监控任务状态。支持通过 API 与外部系统集成,实现一键部署与自动化调度。

丰富的使用场景
支持多租户隔离、任务暂停与恢复操作,深度集成大数据生态系统,内置支持 Spark、Hive、MapReduce、Python、Shell、SubProcess 等近 20 种任务类型,覆盖多种数据处理需求。

高扩展性
调度器采用分布式架构,具备良好的横向扩展能力,Master 和 Worker 节点可动态扩缩容。支持自定义任务插件,满足多样化业务场景的调度需求。

2 部署环境说明

1.CM和CDH版本为6.3.2

2.DolphinScheduler版本为3.1.9

3.Zookeeper为3.7.1

本次使用3个节点来部署DolphinScheduler集群,搭建一个具备高可用性的集群,具体部署节点及角色分配如下:

IP主机名备注
172.10.0.2cm.hadoopMaster/Worker/API/Altert/ZK
172.10.0.3cdh01.hadoopMaster/Worker/API
172.10.0.4cdh02.hadoopMaster/Worker

3 前置条件准备

安装zookeeper

由于cdh集群中的zookeepr是给cdh定制使用的是3.4.5版本无法满足dolphinscheduler3.1.9的使用要求(最低 3.5),所以需要单独安装

1、下载:

wget https://downloads.apache.org/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz

2、快速启动(单机):

tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz
cd apache-zookeeper-3.7.1-bin
mkdir data
echo "tickTime=2000
dataDir=./data
clientPort=2181
initLimit=5
syncLimit=2" > conf/zoo.cfgbin/zkServer.sh start

然后配置 DolphinScheduler 指向这个 ZooKeeper。

4 安装DolphinScheduler

4.1 环境准备

1.准备最新的DolphinScheduler安装包地址如下:

https://dolphinscheduler.apache.org/zh-cn/download/download.html

将下载好的安装包上传至cm节点的/root目录下。

2.分别在CDH3台服务器上执行以下命令,配置java环境变量:

sudo bash -c 'cat >> /etc/profile <<EOFexport JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera/
export PATH=\$JAVA_HOME/bin:\$PATH
export CLASSPATH=\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tool.jar:\$CLASSPATHEOF'
source /etc/profile

3.确保集群所有节点安装了psmisc包,安装命令如下:

yum -y install psmisc

4.创建DS的部署用户和集群所有节点的hosts映射,在集群所有节点配置/etc/hosts映射文件,内容如下:

5.在集群所有节点使用root用户执行如下命令,向操作系统添加一个dolphin用户,设置用户密码为dolphin123,并为该用户配置sudo免密权限

useradd dolphin
echo "dolphin123" | passwd --stdin dolphin
echo 'dolphin  ALL=(ALL)  NOPASSWD: NOPASSWD: ALL' >> /etc/sudoers
sed -i 's/Defaults    requirett/#Defaults    requirett/g' /etc/sudoers

在所有节点执行如下命令切换只dolphin用户下验证sudo免密权限配置是否成功

su - dolphin
sudo ls /root/

注意:此处必须创建部署用户dolphin,且为该用户配置sudo免密,否则该用户无法通过sudo -u的方式切换不同的Linux用户的方式来实现多租户运行作业。

6.选择cm.hadoop节点作为主节点,配置cm节点本机ssh免密登录及到01和02节点的免密登录(这里使用dolphin用户配置)

在集群所有节点执行如下命令为dolphin用户生成秘钥文件(一路回车即可):

su - dolphin
ssh-keygen -t rsa

将dolphin用户/home/dolphin/.ssh目录下的公钥文件内容添加到authorized_keys文件中,并修改文件权限为600

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ll ~/.ssh/authorized_keys 

测试cm节点免密登录配置是否成功

ssh localhost

在这里插入图片描述

将cm节点生成的authorized_keys文件拷贝至其他节点的/home/dolphin/.ssh目录下(密码为:dolphin123):

scp ~/.ssh/authorized_keys dolphin@cdh01:/home/dolphin/.ssh/
scp ~/.ssh/authorized_keys dolphin@cdh02:/home/dolphin/.ssh/

完成上述操作后,验证01到其他节点的免密登录配置是否成功,不需要配置密码则表示配置成功。

7.将上传到cm节点的DolphinScheduler安装包拷贝解压至/home/dolphin目录下

将宿主机dolphin文件上传到cm容器

docker cp apache-dolphinscheduler-3.1.9-bin.tar.gz cm.hadoop:/home/dolphin/
cd /home/dolphin
tar -zxvf apache-dolphinscheduler-3.1.9-bin.tar.gz
chown -R dolphin. apache-dolphinscheduler-3.1.9-bin

4.2 配置dolphinscheduler_env.sh

修改./bin/env/dolphinscheduler_env.sh 配置文件,配置DolphinScheduler的环境变量:

1.mysql配置:

vim ./bin/env/dolphinscheduler_env.sh

在这里插入图片描述

2.jdk配置:

修改JAVA_HOME:

3.hadoop相关配置:

因为我们的hadoop的相关服务是使用cdh搭建的所以目录全都在/opt/cloudera/parcels/CDH/lib目录下,下面是具体的配置:

export HADOOP_HOME=${HADOOP_HOME:-/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/etc/hadoop}
export SPARK_HOME1=${SPARK_HOME1:-/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark}
export SPARK_HOME2=${SPARK_HOME2:-/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark}
export PYTHON_HOME=${PYTHON_HOME:-/opt/soft/python}
export HIVE_HOME=${HIVE_HOME:-/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hive}
export FLINK_HOME=${FLINK_HOME:-/opt/soft/flink}
export DATAX_HOME=${DATAX_HOME:-/opt/software/datax/datax}
export SEATUNNEL_HOME=${SEATUNNEL_HOME:-/opt/soft/seatunnel}
export CHUNJUN_HOME=${CHUNJUN_HOME:-/opt/soft/chunjun}

在这里插入图片描述

4.3 配置common.properties

修改 api-server/conf/worker-server/conf/ ⽬录下的这个⽂件,该⽂件主要⽤来配置资源上传相关参数,⽐如将海豚的资源上传到 hdfs 等,文件内容如下:

# user data local directory path, please make sure the directory exists and have read write permissions
data.basedir.path=/tmp/dolphinscheduler# resource view suffixs
#resource.view.suffixs=txt,log,sh,bat,conf,cfg,py,java,sql,xml,hql,properties,json,yml,yaml,ini,js# resource storage type: HDFS, S3, OSS, NONE
resource.storage.type=HDFS
# resource store on HDFS/S3 path, resource file will store to this base path, self configuration, please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
resource.storage.upload.base.path=/dolphinscheduler# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=minioadmin
# The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.secret.access.key=minioadmin123
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=cn-north-1
# The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
resource.aws.s3.bucket.name=dolphinscheduler
# You need to set this parameter when private cloud s3. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
resource.aws.s3.endpoint=http://192.168.1.39:10020# alibaba cloud access key id, required if you set resource.storage.type=OSS
resource.alibaba.cloud.access.key.id=<your-access-key-id>
# alibaba cloud access key secret, required if you set resource.storage.type=OSS
resource.alibaba.cloud.access.key.secret=<your-access-key-secret>
# alibaba cloud region, required if you set resource.storage.type=OSS
resource.alibaba.cloud.region=cn-hangzhou
# oss bucket name, required if you set resource.storage.type=OSS
resource.alibaba.cloud.oss.bucket.name=dolphinscheduler
# oss bucket endpoint, required if you set resource.storage.type=OSS
resource.alibaba.cloud.oss.endpoint=https://oss-cn-hangzhou.aliyuncs.com# if resource.storage.type=HDFS, the user must have the permission to create directories under the HDFS root path
resource.hdfs.root.user=hdfs
# if resource.storage.type=S3, the value like: s3a://dolphinscheduler; if resource.storage.type=HDFS and namenode HA is enabled, you need to copy core-site.xml and hdfs-site.xml to conf dir
resource.hdfs.fs.defaultFS=hdfs://cm:8020# whether to startup kerberos
hadoop.security.authentication.startup.state=false# java.security.krb5.conf path
java.security.krb5.conf.path=/opt/krb5.conf# login user from keytab username
login.user.keytab.username=hdfs-mycluster@ESZ.COM# login user from keytab path
login.user.keytab.path=/opt/hdfs.headless.keytab# kerberos expire time, the unit is hour
kerberos.expire.time=2# resourcemanager port, the default value is 8088 if not specified
resource.manager.httpaddress.port=8088
# if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single, keep this value empty
yarn.resourcemanager.ha.rm.ids=192.168.xx.xx,192.168.xx.xx
# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single, you only need to replace ds1 to actual resourcemanager hostname
yarn.application.status.address=http://cm:%s/ws/v1/cluster/apps/%s
# job history status url when application number threshold is reached(default 10000, maybe it was set to 1000)
yarn.job.history.status.address=http://cm:19888/ws/v1/history/mapreduce/jobs/%s# datasource encryption enable
datasource.encryption.enable=false# datasource encryption salt
datasource.encryption.salt=!@#$%^&*# data quality option
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar#data-quality.error.output.path=/tmp/data-quality-error-data# Network IP gets priority, default inner outer# Whether hive SQL is executed in the same session
support.hive.oneSession=false# use sudo or not, if set true, executing user is tenant user and deploy user needs sudo permissions; if set false, executing user is the deploy user and doesn't need sudo permissions
sudo.enable=true
setTaskDirToTenant.enable=false# network interface preferred like eth0, default: empty
#dolphin.scheduler.network.interface.preferred=# network IP gets priority, default: inner outer
#dolphin.scheduler.network.priority.strategy=default# system env path
#dolphinscheduler.env.path=dolphinscheduler_env.sh# development state
development.state=false# rpc port
alert.rpc.port=50052# set path of conda.sh
conda.path=/opt/anaconda3/etc/profile.d/conda.sh# Task resource limit state
task.resource.limit.state=false# mlflow task plugin preset repository
ml.mlflow.preset_repository=https://github.com/apache/dolphinscheduler-mlflow
# mlflow task plugin preset repository version
ml.mlflow.preset_repository_version="main"

api-server修改成功之后可以直接将它的文件复制到worker-server中,因为他们的内容都是一样的

使用命令:

cp ./api-server/conf/common.properties ./worker-server/conf/

4.4 配置mysql

1.初始化DolphinScheduler数据库

由于数据库选择的是MySQL,在初始化前,需要将mysql(8.x)驱动分别放到./api-server/libs./alert-server/libs./master-server/libs./worker-server/libs./tools/libs./standalone-server/libs、`` 目录下

cd /home/dolphin/apache-dolphinscheduler-3.1.9-bin;
wget -O /home/dolphin/apache-dolphinscheduler-3.1.9-bin/mysql-connector-j-8.0.33.tar.gz \
https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.33.tar.gz \
&& tar -zxvf mysql-connector-j-8.0.33.tar.gz
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./api-server/libs/
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./alert-server/libs/
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./master-server/libs/ 
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./standalone-server/libs/ 
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./worker-server/libs/
cp  ./mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar ./tools/libs/
2.使用root用户登录,执行如下SQL语句创建DS的数据库

由于cdh安装的mysql是5.7版本所以我们需要更新,mysql客户端

yum install -y https://dev.mysql.com/get/mysql80-community-release-el7-11.noarch.rpm \
&& yum install -y mysql-community-client
mysql -h mysql -uroot -p123456Aa.
CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE USER IF NOT EXISTS 'dolphins'@'%' IDENTIFIED BY '123456Aa.';
GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'dolphins'@'%';ALTER USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY '123456Aa.';
flush privileges;

最后执行

bash tools/bin/upgrade-schema.sh

执行结果大概就是这样

在这里插入图片描述

4.5 更换dolphinscheduler连接hive jar包

由于cdn6.3.2的hive版本是2.1.1,耳dolphinscheduler3.1.9自带的版本比较高会导致连不上hive的问题,所以这里需要降级处理

1.先上传至docker容器内

docker cp hive-common-2.1.1.jar cm.hadoop1:/opt
docker cp hive-jdbc-2.1.1.jar cm.hadoop1:/opt
docker cp hive-metastore-2.1.1.jar cm.hadoop1:/opt
docker cp hive-orc-2.1.1.jar cm.hadoop1:/opt
docker cp hive-serde-2.1.1.jar cm.hadoop1:/opt
docker cp hive-service-2.1.1.jar cm.hadoop1:/opt
docker cp hive-service-rpc-2.1.1.jar cm.hadoop1:/opt
docker cp hive-storage-api-2.1.1.jar cm.hadoop1:/opt

2.删除原有的jar包

cd /home/dolphin/dolphinscheduler3.1.9/rm -rf ./api-server/libs/hive-*
rm -rf ./alert-server/libs/hive-*
rm -rf ./worker-server/libs/hive-*
rm -rf ./tools/libs/hive-*

3.将上传至容器的jar复制到对应的服务中

cp /opt/hive-* ./api-server/libs/
cp /opt/hive-* ./alert-server/libs/
cp /opt/hive-* ./worker-server/libs/
cp /opt/hive-* ./tools/libs/

执行安装脚本

bash bin/install.sh

安装后会自动启动

浏览器访问
http:[IP]:12345/dolphinscheduler
在这里插入图片描述

http://www.xdnf.cn/news/1214245.html

相关文章:

  • python | numpy小记(八):理解 NumPy 中的 `np.meshgrid`
  • 嵌入式linux驱动开发:什么是Linux驱动?深度解析与实战入门
  • 如何通过IT-Tools与CPolar构建无缝开发通道?
  • OriGene:一种可自进化的虚拟疾病生物学家,实现治疗靶点发现自动化
  • 【ESP32设备通信】-LAN8720与ESP32集成
  • MOEA/DD与MOEA/D的区别
  • 2024 年 NOI 最后一题题解
  • 算法精讲:二分查找(二)—— 变形技巧
  • 【Excel】制作双重饼图
  • 关于windows虚拟机无法联网问题
  • VMware16安装Ubuntu-22.04.X版本(并使用桥接模式实现局域网下使用ssh远程操作Ubuntu系统)
  • 【硬件-笔试面试题】硬件/电子工程师,笔试面试题-51,(知识点:stm32,GPIO基础知识)
  • C++菱形虚拟继承:解开钻石继承的魔咒
  • 简单线性回归模型原理推导(最小二乘法)和案例解析
  • 线性回归的应用
  • 明智运用C++异常规范(Exception Specifications)
  • 爬虫验证码处理:ddddocr 的详细使用(通用验证码识别OCR pypi版)
  • 架构实战——架构重构内功心法第一式(有的放矢)
  • 地图可视化实践录:显示高德地图和百度地图
  • Linux 进程管理与计划任务详解
  • 关于神经网络CNN的搭建过程以及图像卷积的实现过程学习
  • Mac下的Homebrew
  • 如何不让android studio自动换行
  • cpp c++面试常考算法题汇总
  • 高防CDN与高防IP的选择
  • 【ip】IP地址能否直接填写255?
  • SpringBoot升级2.5.3 2.6.8
  • gtest框架的安装与使用
  • 基于成像空间转录组技术的肿瘤亚克隆CNV原位推断方法
  • android-PMS-创建新用户流程