jadianes / spark-py-notebooks Goto Github PK

View Code? Open in Web Editor NEW

1.6K 98.0 910.0 2.26 MB

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Home Page: http://jadianes.github.io/spark-py-notebooks

License: Other

Jupyter Notebook 100.00%

spark python pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning

spark-py-notebooks's People

Contributors

Stargazers

Watchers

Forkers

aroopmp udemirezen goryszewskig uhrovat jxlijunhao bryanyang0528 nkhuyu mt0803 maplenvg wonyonyon xsongx thiagoveras caohy1988 marcotuliozahn datascience102 anurive carol270 yuanfang619 nikolayvoronchikhin slitayem hanxirui masdude chenzhongtao neufang hope-onely tomdcsmith jjdblast adzilla gitter-badger monajalal medh2000 annahpryor elianomarques jude90 weixshen maruthiprithivi 0xskl shivankurkapoor podspods dpzhou tristaneljed vishwakarmarahul gauravkhare hillash imperio-wxm blueroutecn harishraj anirudhreddy92 sharplu zfkl folkcode bateou25 mguo001 anuragreddygv323 adriangprado ramakers arpit12 iwhisper smusa stei0792 qsbbq umbalaloan priyaranjan1202 botwithtomo justinzhq willzhang007 gachet jamesbconner fnokeke jackieqizhu chusopr guotechfin agilemobiledev ppxie drapadubok anupampandey1924 franklatta vijaynitrr denyskoech maheshpshankar dgan1991 benjamwhite szeitlin tkamag dennymtz imnmo manaranjanp elinok mindis hsingjun0 jasveenbajwa jcassiojr fzeeshan mayankti bkjackson wuzhongdehua dorisjlee patricktang786 kungfupandey rjonczy

spark-py-notebooks's Issues

urllib module in nb1-rdd-creation

I think for python3.x users,urllib module has been split into several modules and therefore
import urllib.request.urlretrieve will make more sense i guess.
Possibly update on the same if you thing is needed.

[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)

@jadianes
hello I'm Hiroyuki.
nice Tutorial, Thank you!

In[7]

tcp_interactions_out = tcp_interactions.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

but map can use only for RDD.
so we need to change tcp_interactions(DataFrame) to RDD , I think.

here is the sample

tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

how do you think about it?

If there is my mistake in my code or in my sentence , sorry. (couse Im not good at writting English)
please forgive me if I make you feel bad.

Apparent Memory Issues

juyptererror.txt
commandprompt.txt
commandprompterror.txt

Hi - I am a student attempting to learn how to use PYSPSARK/JUPYTER to build classification models for large data. I installedPYSPARK V2.2.1 and Juypter as per tutorial on medium website by Michael Galarnyk. It seemed to install ok and I was able to run your first notebook. However in the second notebook nb2-rdd-basics I had problems with the "collect" code

from time import time
t0 = time()
head_rows = csv_data.take(100000)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
Thinking it was a memory issue I then launched Jupyter with command
pyspark --master local[4] --driver-memory 32g --executor-memory 32g
I have attached the Juypter error and command prompt data before and after error
Please help - how do I increase memory in the kernel

spark context

I had an issue with the command line
$ MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark

the error was Connection refused: /127.0.0.1:7077

and was resolved with
$ MASTER=local[4] SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark
maybe you could say a word in the readme about it.

Otherwise great notebooks and great help Thank you!

Logistic Regression with LBFGS in Spark 1.6 and 2.1

@jadianes Nice tutorial on Logistic Regression, thankyou.
I ran the tutorial on Spark 1.6.2 and 2.1.0 - both ran fine and I could repeat your results perfectly in 1.6.2, but I would like to offer the following observation re 2.1.0. In 2.1.0 the process takes about 3 times longer to run and produces a different answer than that produced by 1.6.2. I thought this was strange and found that in the list of Spark tasks 2.1.0 was calling a non-LBFGS algorithm. I raised this issue in a JIRA question (https://issues.apache.org/jira/browse/SPARK-16768). It seems that even though a user can import the LBFGS version into pyspark and you can call help on it and actually call it, I don't think it is actually an LBFGS version.
http://spark.apache.org/docs/latest/mllib-optimization.html has some other information on LBFGS in Spark.
Later when 2.1.0 becomes the standard your readers may find that they don't get your results for accuracy. Or maybe I just missed something, can anyone confirm my observations?

Question on: Pyspark MLib Model want to deploy on docker, But the performance is out of expectation

Env: spark standalone on docker

Case: the trained pyspark model (randomforest) deployed on docker

Questions: When I use gunicorn to start the service, including (model loading, prediction) and expose API service with Python Flask framework, it seems pretty slow to call the api..

Could I get your help or any suggestions on spark model deployment? Thanks!

Website isn't working

Thanks for the tutorials!
The domain of the website is probably expired and the .github.io link is routing to that domain too.

Possible solutions:

Renew the domain subscription
Cancel the alias or record that's causing the GitHub page to go to the custom domain

jadianes / spark-py-notebooks Goto Github PK

spark-py-notebooks's People

Contributors

Stargazers

Watchers

Forkers

spark-py-notebooks's Issues

urllib module in nb1-rdd-creation

[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)

Apparent Memory Issues

spark context

Logistic Regression with LBFGS in Spark 1.6 and 2.1

Question on: Pyspark MLib Model want to deploy on docker, But the performance is out of expectation

Website isn't working

Integrate with k8s

The notebooks arent loading

license?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent