-
Notifications
You must be signed in to change notification settings - Fork 0
Apache livy
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. Apache Livy also simplifies the interaction between Spark and application servers, thus enabling the use of Spark for interactive web/mobile applications.
Livy's basic architecture
Livy is a typical REST service architecture that accepts and parses a user's REST request and translates it into a corresponding operation. On the other hand, it manages all Spark clusters launched by the user.
Users can start a new Spark cluster through Livy in a REST request. Livy calls each launched Spark cluster a session. A session is composed of a complete Spark cluster and passes the RPC protocol. Communicate between the Spark cluster and the Livy server. Depending on how the interaction is handled, Livy divides the conversation into two types:
Interactive session (interactive session), which is the same as interactive processing in Spark. After the session is started, the interactive session can receive the code fragments submitted by the user and compile and execute on the remote Spark cluster.
Batch session, the user can start the Spark application in batch mode through Livy. This way is called a batch session in Livy, which is the same as the batch in Spark.
The core functionality provided by Livy is the same as native Spark. It provides two different session types to replace the two different types of processing interactions in Spark. Let's take a closer look at these two types of sessions.
Using interactive sessions is similar to using Spark's own spark-shell, pyspark, or sparkR. They are all submitted by the user to the REPL, compiled by the REPL into a Spark job and executed. The main difference is that spark-shell will start the REPL on the current node to receive the user's input, while the Livy interactive session starts the REPL in the remote Spark cluster. All code and data need to be transmitted over the network. . Let's take a look at how to use an interactive session.
Create an interactive session
POST /sessions
curl -X POST -d '{"kind": "spark"}' -H "Content-Type: application/json" <livy-host>:<port>/sessions</port></livy-host>
The prerequisite for using an interactive session is that you need to create a session first. When we submit a request to create an interactive session, we need to specify the type of session ("kind"), such as "spark", Livy will start the corresponding REPL according to the type we specify, the current Livy can support spark, pyspark or Sparkr three different interactive session types to meet the needs of different languages.
When the session is created, Livy will return us a JSON-formatted data structure that represents all the information about the current session:
{
"appId": "application_1493903362142_0005",
โฆ
"id": 1,
"kind": "spark",
"log": [ ],
"owner": null,
"proxyUser": null,
"state": "idle"
}
- Long running Spark Contexts that can be used for multiple Spark jobs, by multiple clients
- Share cached RDDs or DataSets across multiple jobs and clients
- Multiple Spark Contexts can be managed simultaneously, and the Spark Contexts run on the cluster (YARN/Mesos) instead of the Livy Server, for good fault tolerance and concurrency
- Jobs can be submitted as pre compiled jars, snippets of code or via java/scala client API
- Ensure security via secure authenticated communication
- Using programmatic API
- Running interactive statements through REST API
- Submitting batch applications with REST API
Livy uses a few configuration files under configuration the directory, which by default is the conf directory under the Livy installation. An alternative configuration directory can be provided by setting the LIVY_CONF_DIR environment variable when starting Livy.
livy.conf: contains the server configuration. spark-blacklist.conf: list Spark configuration options that users are not allowed to override. These options will be restricted to either their default values, or the values set in the Spark configuration used by Livy.
log4j.properties: configuration for Livy logging.
<dependency>
<groupId>org.apache.livy</groupId>
<artifactId>livy-client-http</artifactId>
<version>0.5.0-incubating</version>
</dependency>
- https://databricks.com/session/building-a-rest-job-server-for-interactive-spark-as-a-service
- https://livy.apache.org/
- https://eng.uber.com/uscs-apache-spark/
- https://dzone.com/articles/quick-start-with-apache-livy
- http://gethue.com/how-to-use-the-livy-spark-rest-job-server-api-for-submitting-batch-jar-python-and-streaming-spark-jobs/
- https://github.com/apache/incubator-livy/tree/master/examples/src/main
- https://github.com/manikandan89/livy-spring-app
- https://towardsdatascience.com/how-to-do-better-deployments-of-spark-batch-jobs-to-aws-emr-using-apache-livy-adc2417f0d8b
- https://blog.csdn.net/dpnice/article/details/80747476