Deploy Spark Standalone Cluster

Standalone is a master-slave cluster deployment mode provided by Spark itself. This article describes a conventional 1-master-multiple-slave cluster deployment mode. In this mode, the master service relies on Kato platform monitoring to ensure its availability and supports rescheduling and restart. The worker service can scale multiple nodes as needed.

The Deployment Effect is as Follows:

Kato Deployment Renderings

Spark master UI diagram

Deployment Steps

Before starting, you need to complete the installation and setup of the Kato platform, refer to Kato Installation and Deployment This reference document is suitable for those who have mastered Kato Basic operation students, so if you are new to the Kato platform, please refer to the Kato Quick Start Guide

Deploy Single-instance Master Service

  1. Deploy spark-master, and use Kato to create components based on Docker images:

  1. After confirming that the test is created successfully, select Advanced Settings to perform three special settings.

    • Add environment variables in the environment variable module

      SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/data

      • We need to set spark-master to “Recovery with Local File System” mode. After the master restarts, data can be restored from the persistent files to maintain the availability of the master service.
    • Add shared storage /data in the storage settings to persist the master’s data so that it can be restored after restart.

    • Open the external service of port 8080 in the port management, after the component is started successfully, you can access the UI of the master.

    • In the deployment properties, select the component type as Stateful Single Instance

      After being deployed as a stateful component, it can obtain a stable internal access domain name for the worker component to connect. Stateful service control can ensure that the master node will not start multiple instances repeatedly.

  2. After the setting is complete, select confirm creation to start the master service.

If the component is successfully clicked to visit, the master UI can be opened. As shown in the figure above, we can get the access address of the master service in the UI: spark://gr7b570e:7077, note that the address displayed on the UI is spark://gr7b570e-0:7077 We need to use The one is spark://gr7b570e:7077, copy and record this address.

Note, please check your UI display for the actual value of the address, here is just an example.

Deploy Multi-instance Worker Instances

  1. Deploy spark-worker and create components based on the Docker-run command. This creation method can directly set some necessary attributes:

    docker run -it -e SPARK_MASTER=spark://gr7b570e:7077 -e SPARK_WORKER_MEMORY=1g bde2020/spark-worker:3.0.1-hadoop3.2

    Two environment variables are specified in the above creation method.

    • SPARK_MASTER specifies the address of the master, which is obtained by the component created in the previous step.
    • SPARK_WORKER_MEMORY sets the amount of memory for a single instance of the worker. This can be set according to the memory allocated by each instance. For example, if each instance allocates 1GB, set SPARK_WORKER_MEMORY=1g. If this variable is not set, the service will automatically read the amount of memory of the operating system. Since we are using the container deployment method, the value read will be the entire memory of the host. Will be much larger than the available memory value actually allocated by the worker instance.
  1. Also enter the advanced settings and set the component deployment mode to Stateful multi-instance.
  2. Confirm the creation of the component. After successful startup, you can set the number of running instances of the worker on the component’s scaling page.

At this point, our Spark cluster has been deployed.

Spark Data Reading

The principle of nearby data processing is gradually broken

In the past, we preferred to deploy data processing services (hadoop, yarn, etc.) to the place closest to the data. The main reason is that the Hadoop calculation data mode consumes more IO. If the data and calculation are classified, the consumption of network IO will be greater and the network bandwidth requirements will be greater.

But the Spark mechanism is different. The Spark computing mode is to cache data in memory as much as possible, which means that Spark consumes resources mainly in memory and CPU. Then the device memory and CPU allocation for storing data may not be sufficient. Therefore, separation of data and calculation would be a better choice.

More options after data and calculation are separated

The separation of data and computing means that computing services are deployed separately, and storage services provide data for computing services through the network. Through the network, it means that there are multiple protocol modes to choose from. In addition to traditional HDFS, object storage is commonly used at present, such as various services compatible with S3, or distributed file system, which can be based on data types and actual needs. Reasonable choice. The computing service (spark worker) can flexibly allocate computing resources in the distributed cluster according to the needs of the task.

This article describes the deployment of Spark cluster in Kato is such a use case.

Master Node Active and Standby High Availability

Based on ZooKeeper, Spark can provide master service switching between active and standby. The configuration method is also relatively simple, refer to Official Document.