加载中

In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

Installing Spark Standalone to a Cluster

The easiest way to deploy Spark is by running the./make-distribution.shscript to create a binary distribution. This distribution can be deployed to any machine with the Java runtime installed; there is no need to install Scala.

The recommended procedure is to deploy and start the master on one node first, get the master spark URL, then modifyconf/spark-env.shin thedist/directory before deploying to all the other nodes.

除了在 Mesos 或 YARN 集群上运行之外, Spark 还提供一个简单的独立部署的模块。你通过手动开始master和workers 来启动一个独立的集群。你也可以利用我们提供的脚本 .它也可以运行这些进程在单个机器上进行测试。

安装 Spark 独立集群

部署Spark最简单的方法就是运行./make-distribution.sh 脚本来创建一个2进制发行版.这个版本能部署在任意运行这java的机子上,不需要安装 Scala.

建议的步棸是先在一个节点部署并启动master,获得 master spark URL,在dist/这个目录下修改conf/spark-env.sh然后再部署到其他的节点上。


Starting a Cluster Manually

You can start a standalone master server by executing:

./bin/start-master.sh

Once started, the master will print out aspark://HOST:PORTURL for itself, which you can use to connect workers to it, or pass as the “master” argument toSparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.

Similarly, you can start one or more workers and connect them to the master via:

./spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).

Finally, the following configuration options can be passed to the master and worker:

Argument Meaning
-i IP,--ip IP IP address or DNS name to listen on
-p PORT,--port PORT Port for service to listen on (default: 7077 for master, random for worker)
--webui-port PORT Port for web UI (default: 8080 for master, 8081 for worker)
-c CORES,--cores CORES Total CPU cores to allow Spark applicatons to use on the machine (default: all available); only on worker
-m MEM,--memory MEM Total amount of memory to allow Spark applicatons to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker
-d DIR,--work-dir DIR Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker


手动启动集群

通过如下命令启动单独模式的master服务:

./bin/start-master.sh

一旦启动,master就会输出spark://IP:PORT以提示连接 workers 的方式。也可以通过参数“master”给SparkContext来连接集群的作业.你可以在master的web管理界面上看到这样的地址,默认是http://localhost:8080.

同样,你可以启动一个或者多个worker,通过下面的语句使之和master建立连接:

./spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

启动一个worker后,查看 master的 web管理界面 (默认http://localhost:8080),上面列出了新近加入的节点的CPU和内存的信息。(不包括给操作系统预留的内存空间)。

最后,以下 master 和 worker的一些配置选项:

参数 含义
-i IP,--ip IP 要监听的IP地址或者 DNS 机器名
-p PORT,--port PORT 要监听的端口 (默认: master 7077 ;worker随机)
--webui-port PORT web UI端口 (默认: master 8080, worker 8081)
-c CORES,--cores CORES

作业可用的CPU内核数量(默认: 所有可用);只在worker上

-m MEM,--memory MEM 作业可使用的内存容量,默认格式1000M或者 2G (默认:  所有RAM去掉给操作系统用的1 GB); 只在worker上。
-d DIR,--work-dir DIR

伸缩空间和日志输入的目录路径

(默认: SPARK_HOME/work); 只在worker上


Cluster Launch Scripts

To launch a Spark standalone cluster with the launch scripts, you need to create a file calledconf/slavesin your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-lessssh(using a private key). For testing, you can just putlocalhostin this file.

Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available inSPARK_HOME/bin:

  • bin/start-master.sh- Starts a master instance on the machine the script is executed on.
  • bin/start-slaves.sh- Starts a slave instance on each machine specified in theconf/slavesfile.
  • bin/start-all.sh- Starts both a master and a number of slaves as described above.
  • bin/stop-master.sh- Stops the master that was started via thebin/start-master.shscript.
  • bin/stop-slaves.sh- Stops the slave instances that were started viabin/start-slaves.sh.
  • bin/stop-all.sh- Stops both the master and the slaves as described above.


集群启动脚本

通过脚本启动 Spark独立集群时, 需要在Spark 目录下创建一个文件 conf/slaves, 列出所有启动的的Spark workers的主机名,每行一条记录. Master必须能够实现通过ssh(使用私钥)访问worker机器,可以使用ssh localhost来测试。

一旦你建立了这个档案,你可以通过以下脚本停止或启动集群, 这些脚本基于 Hadoop’s 部署脚本, 在SPARK_HOME/bin目录:

  • bin/start-master.sh-在机器上执行脚本,启动 master .
  • bin/start-slaves.sh- 启动conf/slaves中指定的每一个slave .
  • bin/start-all.sh- 同时启动master 以及 上面所说文件中指定的slave
  • bin/stop-master.sh- 停止通过bin/start-master.sh脚本启动的master
  • bin/stop-slaves.sh- 停止通过bin/start-slaves.sh启动的slave .
  • bin/stop-all.sh- 停止上述的两种启动脚本启动的master和slave
Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.


You can optionally configure the cluster further by setting environment variables inconf/spark-env.sh. Create this file by starting with theconf/spark-env.sh.template, and copy it to all your worker machines for the settings to take effect. The following settings are available:

Environment Variable Meaning
SPARK_MASTER_IP Bind the master to a specific IP address, for example a public one.
SPARK_MASTER_PORT Start the master on a different port (default: 7077).
SPARK_MASTER_WEBUI_PORT Port for the master web UI (default: 8080).
SPARK_WORKER_PORT Start the Spark worker on a specific port (default: random).
SPARK_WORKER_DIR Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).
SPARK_WORKER_CORES Total number of cores to allow Spark applications to use on the machine (default: all available cores).
SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g.1000m,2g(default: total memory minus 1 GB); note that each application's individual memory is configured using itsspark.executor.memoryproperty.
SPARK_WORKER_WEBUI_PORT Port for the worker web UI (default: 8081).
SPARK_WORKER_INSTANCES Number of worker instances to run on each machine (default: 1). You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. If you do set this, make sure to also setSPARK_WORKER_CORESexplicitly to limit the cores per worker, or else each worker will try to use all the cores.
SPARK_DAEMON_MEMORY Memory to allocate to the Spark master and worker daemons themselves (default: 512m).
SPARK_DAEMON_JAVA_OPTS JVM options for the Spark master and worker daemons themselves (default: none).

Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.

注意:只能在运行Spark的master主机上执行上述脚本,而不是你的本地机器。

你可以通过conf/spark-env.sh进一步配置整个集群的环境变量。这个文件可以用conf/spark-env.sh.template当模版复制生成。然后,复制到所有的worker机器上才奏效。下面给出一些可选的参数以及含义:

Environment Variable Meaning
SPARK_MASTER_IP 绑定一个外部IP给master.
SPARK_MASTER_PORT 从另外一个端口启动master(默认: 7077)
SPARK_MASTER_WEBUI_PORT Master的web UI端口 (默认: 8080)
SPARK_WORKER_PORT 启动Spark worker 的专用端口(默认:随机)
SPARK_WORKER_DIR 伸缩空间和日志输入的目录路径(默认: SPARK_HOME/work);
SPARK_WORKER_CORES 作业可用的CPU内核数量(默认: 所有可用的);
SPARK_WORKER_MEMORY 作业可使用的内存容量,默认格式1000M或者 2G (默认:  所有RAM去掉给操作系统用的1 GB);注意:每个作业自己的内存空间由SPARK_MEM决定。
SPARK_WORKER_WEBUI_PORT worker 的web UI 启动端口(默认: 8081)
SPARK_WORKER_INSTANCES 没太机器上运行worker数量 (默认: 1). 当你有一个非常强大的计算机的时候和需要多个Spark worker进程的时候你可以修改这个默认值大于1 . 如果你设置了这个值。要确保SPARK_WORKER_CORE 明确限制每一个r worker的核心数, 否则每个worker 将尝试使用所有的核心
SPARK_DAEMON_MEMORY 分配给Spark master和 worker 守护进程的内存空间 (默认: 512m)
SPARK_DAEMON_JAVA_OPTS Spark master 和 worker守护进程的JVM 选项(默认: none)

注意: 启动脚本目前不支持Windows。要运行一个Spark 集群在Windows上,手动启动master 和 workers 


Connecting an Application to the Cluster

To run an application on the Spark cluster, simply pass thespark://IP:PORTURL of the master as to the SparkContextconstructor.

To run an interactive Spark shell against the cluster, run the following command:

MASTER=spark://IP:PORT ./spark-shell

Note that if you are running spark-shell from one of the spark cluster machines, thespark-shellscript will automatically set MASTER from theSPARK_MASTER_IPandSPARK_MASTER_PORTvariables inconf/spark-env.sh.

You can also pass an option-c <numCores>to control the number of cores that spark-shell uses on the cluster.

集群连接应用程序

在Spark 集群上运行一个应用,只需通过master的 spark://IP:PORT 链接传递到SparkContext构造器

在集群上运行交互式的Spark 命令, 运行如下命令:

MASTER=spark://IP:PORT ./spark-shell

注意,如果你在一个 spark集群上运行了spark-shell脚本,spark-shell 将通过在conf/spark-env.sh下的SPARK_MASTER_IP和SPARK_MASTER_PORT自动设置MASTER .

你也可以传递一个参数-c <numCores> 来控制 spark-shell 在集群上使用的核心数量


Resource Scheduling

The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will acquire. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time. You can cap the number of cores usingSystem.setProperty("spark.cores.max", "10")(for example). This value must be set before initializing your SparkContext.

资源调度

单独部署模式目前只支持FIFO作业调度策略。不过,为了允许多并发执行,你可以控制每一个应用可获得资源的最大值。默认情况下,如果系统中只运行一个应用,它就会获得所有资源。使用类似System.setProperty("spark.cores.max","10")的语句可以获得内核的数量。这个数值在初始化SparkContext之前必须设置好。


Monitoring and Logging

Spark’s standalone mode offers a web-based user interface to monitor the cluster. The master and each worker has its own web UI that shows cluster and job statistics. By default you can access the web UI for the master at port 8080. The port can be changed either in the configuration file or via command-line options.

In addition, detailed log output for each job is also written to the work directory of each slave node (SPARK_HOME/workby default). You will see two files for each job,stdoutandstderr, with all output it wrote to its console.

监控和日志

Spark单独部署模式提供了一个基于WEB的集群监视器。master和每一个worker都会有一个WEB UI来显示集群的统计信息。默认情况下,可以通过8080端口访问master的WEB UI。当然也可以通过配置文件或者命令来修改这个端口值。

另外,每个slave节点上作业运行的日志也会详细的记录到默认的SPARK_HOME/work目录下。每个作业会对应两个文件,stdout和stderr,包含了控制台上的所有的历史输出。


Running Alongside Hadoop

You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typicallyhdfs://<namenode>:9000/path, but you can find the right URL on your Hadoop Namenode’s web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).

和Hadoop同时运行

Spark 作为一个独立的服务,可以和现有的Hadoop集群同时运行。 通过hdfs:// URL,Spark可以访问hadoop集群的HDFS上的数据。(比如地址可以写成hdfs://<namenode>:9000/path,从Namenode的web UI可以获得更确切的URL).或者,专门为Spark搭建一个集群,通过网络访问其他HDFS上的数据,这样肯定不如访问本地数据速度快,除非是都在同一个局域网内。(比如几台Spark机器和Hadoop集群在同一机架上)。

返回顶部
顶部