Comments (15)
Can you please provide command that you use to submit spark terasort? Do you use yarn or standalone cluster deployment mode? Can you please check logs for that blockManager Id?
Thanks,
Peter
from sparkrdma.
Hi,
I used the following commands to submit spark terasort job
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
--master spark://rdma21:7077 /root/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
hdfs:///terasort_in hdfs:///terasort_out
I used the standalone mode.
I do not know where to get blockManagerId. I searched logs folder, but didn't find any blockManagerId there.
from sparkrdma.
Ok can you please check for errors in spark log directory: grep -i error $SPARK_HOME/logs/
. What dataset size do you run on?
from sparkrdma.
Hi, here are zip file in attachment which contains error message.
The dataset size I used is 1g.
spark-root-org.apache.spark.deploy.worker.Worker-1-rdma21.zip
from sparkrdma.
Sorry logs for executors are in work
directory.
from sparkrdma.
Hi,
Here is the error logs in work directory.
stderr.zip
from sparkrdma.
In work
directory there should be folders for each application and there separate folder for each executor. You need to collect executor logs from machines or use NFS. Tried to reproduce your case, it works for me:
- Teragen 1g of data:
spark/bin/spark-submit -v --num-executors 10 --executor-cores 20 --executor-memory 24G --master yarn --class com.github.ehiggs.spark.terasort.TeraGen /hpc/scrap/users/peterr/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g /terasort-input-1g
- Run terasort:
$ cat /hpc/scrap/users/swat/jenkins/spark/spark2.conf
spark.driver.extraJavaOptions -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/
spark.executor.extraClassPath /hpc/scrap/users/swat/jenkins//spark_rdma_artifacts/spark-rdma-2.0-for-spark-2.1.0-jar-with-dependencies.jar
spark.driver.extraClassPath /hpc/scrap/users/swat/jenkins//spark_rdma_artifacts/spark-rdma-2.0-for-spark-2.1.0-jar-with-dependencies.jar
spark.executor.extraJavaOptions -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.executor.instances 16
$ bin/spark-submit -v --executor-cores 3 --properties-file /hpc/scrap/users/swat/jenkins/spark/spark2.conf --executor-memory 124G --master spark://clx-orion-011:7077 --class com.github.ehiggs.spark.terasort.TeraSort /hpc/scrap/users/peterr/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /terasort-input-1g /terasort-output-1g
Could you please also try to generate bigger data. You are running 15 executors for 1 Gb of input data (<100Mb per executor). Or try to run with smaller number of executors to make sure everything is working
from sparkrdma.
Hi,
Thanks for your reply.
I used 4 nodes to start spark job, now the terasort spark job can start and finish successfully.
However, when the terasort spark rdma job running, my zabbix system can only see tcp traffic while the rdma traffic is zero. This is very strange! Because according to the stderr, the ibm.disni has been loaded during the processing.
from sparkrdma.
Do you use Infiniband or Roce? Does your monitoring system configured to monitor RDMA traffic. You can check how to monitor RDMA traffic here: https://community.mellanox.com/docs/DOC-2416
Also you can run some ib perf tests from the perf package and make sure zabbix captures rdma traffic.
from sparkrdma.
Hi,
I use the RoCE network. I run some OFED perftest job,and the zabbix can see RMDA traffic.
What's your command to run SparkRDMA terasort please? I don't know whether there is some difference in the commands to run the SparkRDMA terasort program.
from sparkrdma.
Here's how i run SparkRDMA Ehigg's terasort version:
Teragen 1g of data:
spark/bin/spark-submit -v --num-executors 10 --executor-cores 20 --executor-memory 24G --master yarn --class com.github.ehiggs.spark.terasort.TeraGen /hpc/scrap/users/peterr/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g /terasort-input-1g
Run terasort:
$ cat /hpc/scrap/users/swat/jenkins/spark/spark2.conf
spark.driver.extraJavaOptions -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/
spark.executor.extraClassPath /hpc/scrap/users/swat/jenkins//spark_rdma_artifacts/spark-rdma-2.0-for-spark-2.1.0-jar-with-dependencies.jar
spark.driver.extraClassPath /hpc/scrap/users/swat/jenkins//spark_rdma_artifacts/spark-rdma-2.0-for-spark-2.1.0-jar-with-dependencies.jar
spark.executor.extraJavaOptions -Djava.library.path=/hpc/scrap/users/swat/jenkins/disni/
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.executor.instances 16
$ bin/spark-submit -v --executor-cores 3 --properties-file /hpc/scrap/users/swat/jenkins/spark/spark2.conf --executor-memory 124G --master spark://clx-orion-011:7077 --class com.github.ehiggs.spark.terasort.TeraSort /hpc/scrap/users/peterr/spark-terasort/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /terasort-input-1g /terasort-output-1g
You can find how to run Hibench terasort version here, but the approach is the same. Basically you need to set spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
.
from sparkrdma.
Hi,
I think I found the problem. They set up PFC for the RoCE network in server and switch. The RoCE traffic will go under 5th queue, while other traffic,such as TCP, will go under 0th queue. Do you know how to setup disni to go under 5th queue?
Best,
from sparkrdma.
Disni is just a wrapper over verbs api. If you setup PFC for rdma traffic to go under 5th queue, it'll go there. We've updated our wiki documentation, you can check Advanced forms of flowcontrol.
Let me know if you'll have questions. BTW we've released new version of SparkRDMA. Take a try ;) It has several performance improvements, bug fixes, more verbose error messages, etc.
from sparkrdma.
我们的问题是:总是找不到libdisni,
19/08/28 16:57:44 ERROR RdmaNode: libdisni not found! It must be installed within the java.library.path on each Executor and Driver instance
我们反复检查了配置,找不出问题。能给我们一份您当时安装的详细步骤吗?
from sparkrdma.
@RummySugar You need to install libdisni so in each server or upload with spark-submit
.
from sparkrdma.
Related Issues (20)
- SparkRDMA issue:ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job
- SparkRDMA issue:ERROR scheduler.TaskSetManager: Task 45 in stage 1.0 failed 4 times; aborting job HOT 10
- SPARK RDMA , HIBENCH not able to run. HOT 17
- Getting lower RDMA perf that TCP/IP perf HOT 16
- Seeing ERROR RdmaNode: Failed in RdmaNode constructor in Standalone cluster mode HOT 2
- spark rdma IBV_WC_WR_FLUSH_ERR HOT 4
- spark on yarn Compatibility between different versions? HOT 13
- Fail to setup RoCE IP for Spark in Yarn-cluster mode HOT 9
- libdisni resolve hostname with another IP instead of the IP from RdmaNode HOT 3
- java.lang.NoClassDefFoundError: Could not initialize class com.ibm.disni .rdma.verbs.impl.NativeDispatcher HOT 4
- Fail to re-produce the speed-up of TeraSort with SparkRDMA HOT 2
- Add libdisni.so in wiki to test performance of SparkRDMA HOT 1
- Need memory overhead to run Spark RDMA shuffler HOT 4
- Error in accept call on a passive RdmaChannel HOT 23
- Fail to write HDFS with custom codec when using SparkRDMA HOT 7
- Steam is corrupted when shuffle read with RDMA Shuffle Manager HOT 8
- ERROR RdmaNode: libdisni not found! It must be installed within the java.library.path on each Executor and Driver instance HOT 3
- ClassNotFoundException: org.apache.spark.shuffle.rdma.RdmaShuffleManager HOT 4
- Errors when using 2 or more nodes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkrdma.