-
Visit Apache Spark Downloads page
-
Select following options
- Choose a Spark release: 2.2.x or greater (I'll be using 2.2.1)
- Choose a package type: Pre-built for Apache Hadoop 2.7 and later
- Download Spark: spark-2.2.1-bin-hadoop2.7.tgz
Download that tar compressed file to your local machine.
-
After downloading the compressed file, unzip it to desired location:
$ tar -xvzf spark-2.2.1-bin-hadoop2.7.tgz -C /home/prakshi/spark-2.2.1/
-
Setting up the environment for Spark:
To set up environment variable:
Add following lines to your
~/.bashrc
export SPARK_HOME=/home/prakshi/spark-2.2.1/ export PATH=$SPARK_HOME/bin:$PATH
Make sure you change the path in
SPARK_HOME
as per your spark software file are located. Reload your~/.bashrc
file using:$ source ~/.bashrc
-
That's all! Spark has been set-up. Try running
pyspark
command to use Spark from Python.
Two methods to do so.
-
Configure PySpark driver Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Restart your terminal and launch PySpark again:
$ pyspark
Now, this command should start a Jupyter Notebook in your web browser.
-
Using
findspark
modulefindSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
To install findspark:
$ pip install findspark
Irrespective of Jupyter notebook/Python script all you need to do to use spark is add following line in your code:
import findspark findspark.init()