Comments (4)
Ok, the problem seems to be that Spark 2.0 changed the version of Py4J that is referenced in jupyter/kernels/pyspark/kernel.json
I was able to make it work just by applying this change:
< "PYTHONPATH": "/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.9-src.zip",
---
> "PYTHONPATH": "/usr/lib/spark/python/:/usr/lib/spark/python/lib/py4j-0.10.1-src.zip",
I'm don't know it is the only broken reference as I didn't do an extensive test.
from initialization-actions.
Hi @gilmar where did you make the change (master?,workers?) and would you be able to walk me though the steps?
import pyspark
sc.version
If I run the code above on an SSH terminal to the master node, it returned u'2.0.0'
When I run the code above in a new pyspark workbook with a fresh kernel started.
I get the error you mentioned and then if I run the cell again I get this 1
ImportErrorTraceback (most recent call last)
<ipython-input-2-cc2b46586f8c> in <module>()
----> 1 import pyspark
2 sc.version
/usr/lib/spark/python/pyspark/__init__.py in <module>()
42
43 from pyspark.conf import SparkConf
---> 44 from pyspark.context import SparkContext
45 from pyspark.rdd import RDD
46 from pyspark.files import SparkFiles
/usr/lib/spark/python/pyspark/context.py in <module>()
26 from tempfile import NamedTemporaryFile
27
---> 28 from pyspark import accumulators
29 from pyspark.accumulators import Accumulator
30 from pyspark.broadcast import Broadcast
ImportError: cannot import name accumulators
from initialization-actions.
Hi @mobcdi ,
I was doing a local test. But I have just created a pull request.
Meanwhile, you can try to use the script below as your initialization script. The only difference from the original is that this one is pointing to my fork instead of this repo.
Just move it to your GCP bucket and point to it when creating your cluster, like this:
--initialization-actions gs://$YOUR_BUCKET/jupyter.sh
#!/usr/bin/env bash
set -e
ROLE=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role)
INIT_ACTIONS_REPO=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_REPO || true)
INIT_ACTIONS_REPO="${INIT_ACTIONS_REPO:-https://github.com/gilmar/dataproc-initialization-actions.git}"
INIT_ACTIONS_BRANCH=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_BRANCH || true)
INIT_ACTIONS_BRANCH="${INIT_ACTIONS_BRANCH:-master}"
DATAPROC_BUCKET=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-bucket)
echo "Cloning fresh dataproc-initialization-actions from repo $INIT_ACTIONS_REPO and branch $INIT_ACTIONS_BRANCH..."
git clone -b "$INIT_ACTIONS_BRANCH" --single-branch $INIT_ACTIONS_REPO
# Ensure we have conda installed.
./dataproc-initialization-actions/conda/bootstrap-conda.sh
#./dataproc-initialization-actions/conda/install-conda-env.sh
source /etc/profile.d/conda_config.sh
if [[ "${ROLE}" == 'Master' ]]; then
conda install jupyter
if gsutil -q stat "gs://$DATAPROC_BUCKET/notebooks/**"; then
echo "Pulling notebooks directory to cluster master node..."
gsutil -m cp -r gs://$DATAPROC_BUCKET/notebooks /root/
fi
./dataproc-initialization-actions/jupyter/internal/setup-jupyter-kernel.sh
./dataproc-initialization-actions/jupyter/internal/launch-jupyter-kernel.sh
fi
echo "Completed installing Jupyter!"
# Install Jupyter extensions (if desired)
# TODO: document this in readme
if [[ ! -v $INSTALL_JUPYTER_EXT ]]
then
INSTALL_JUPYTER_EXT=false
fi
if [[ "$INSTALL_JUPYTER_EXT" = true ]]
then
echo "Installing Jupyter Notebook extensions..."
./dataproc-initialization-actions/jupyter/internal/bootstrap-jupyter-ext.sh
echo "Jupyter Notebook extensions installed!"
fi
from initialization-actions.
I'm getting the "ImportError: cannot import name accumulators" error as well. Has anyone solved this?
from initialization-actions.
Related Issues (20)
- [hue] hive editor missing.
- [oozie] intermittent error writing to HDFS during init action HOT 1
- [gpu] ml-on-gcp repo (gpu metrics dependency) to be archived
- Missing linux headers on debian dataproc instances after update HOT 6
- Terraform provider does not offer a sequential ordering option - implement as init action HOT 2
- [bigtable] 2.1 clusters fail to come online with stock bigtable/bigtable.sh HOT 2
- [livy] update livy init action for 2.1 HOT 1
- [rapids] please update to work with latest dask-rapids v22.12
- [gpu] Driver does not install on 2.2 Rocky/Ubuntu images
- [zeppelin] not supported on 2.1+ image versions HOT 1
- Error on wget livy binary naming HOT 5
- [spark-rapids] Drop Spark 2.x support in spark-rapids.sh
- [gpu] apt-get update Init script seeing broken repositories HOT 2
- [bigtable] apt-get update Init script seeing broken repositories
- [cloud-sql-proxy] Running the Cloud SQL Proxy as a persistent service
- Update initialization scripts to install latest RAPIDS `23.12` OR `24.02`
- [gpu] Add tests for GPU agent
- initialization actions which use apt-get update fail due to purged oldoldstable backports repository HOT 10
- rstudio.sh is unable to get the receive keys. Maybe due to invalid repo key. HOT 1
- Dataproc "apt-get update" failed on ubuntu20 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from initialization-actions.