spark-in-action / first-edition Goto Github PK

View Code? Open in Web Editor NEW

272.0 272.0 189.0 8.1 MB

The book's repo

Home Page: https://www.manning.com/books/spark-in-action

Shell 0.30% Scala 36.98% Python 27.29% Java 31.67% HTML 0.92% JavaScript 2.84%

first-edition's People

Contributors

Stargazers

Watchers

Forkers

guzu92 miguelperalvo gnanasundar ganeshchand krishnak7 alucarrd lucentcosmos datagur is00hcw defaultrobot nunofernandes-plight ennpet lynn-yqjykn yindafei rahul-c1 svishnu88 abdheshkumar sivasubbu agilemobiledev eprtvea gachet krishnatray pjel ndjidoardo akirakane an100 mwasa alexandregz plp6koff wuatanabe giserh thimotyb harlixxy meichangsu1 747905245 atayebali gutiankai qicst23 xc35 adicostil honghongw jeperez elinok vishal2232 chrisxin kamalakar-bigdata xuspi xusliebana vishallama joao-parana szokebarnabas sambitkumohanty183 trahasch jjdelrom ppilla1 ykushch davbzh datumsays repocastle dalamar66 azmikamis ynajib ankitsindhu williamdemeo rohitk77 torypages youngwookim ram1991 sensaid anjijava16 deepak-sharma1804 preetilodha gregoirew e-bertrand jaydenwhyte mnozary mparaz hailingc gewmas mneelam fighting-dreamer dgshaver khristinyork jesusnietos gskreddy2432 kouichiume francismanzanilla mrenau kanghuawu aminesagaama latuji abhynavb ylobin stefmt2970 tao-cao coronate-zz ewalsh sffej sridevibaskaran ethanbo0927

first-edition's Issues

Discretized stream - Number of buy or Sell per second ?

The task is to count the number of buy and sell orders per second.

The code example does not take into account the time stamp at all. How is it possible to know that what is reduced is actually within a second. What i understood was that the mini-batch was every 3 seconds.

Honestly i am quite confused, when we say the number of sell and buy per second, what do we mean exactly ? Do we mean buy and sell that have a time stamp that fall within 1 second of distance, do we mean what we get per second, independently of the time stamp ?

Can this be at least clarified ?

Dashboard sources incorrect

Hi,
sources of dashboard in:
https://github.com/spark-in-action/first-edition/releases/

zip
tar.gz

do not contain sources of dashboard.

installation of Vagrant

Section 1.5.1 (Downloading and starting the VM) talks about installing Oracle VirtualBox and Vagrant.
I already have VirtualBox running. Do I need to install Vagrant on my Mac or in VirtualBox.

why files in ch06output/output-*.txt are empty!

`import org.apache.spark._
import org.apache.spark.streaming._

val ssc = new StreamingContext(sc, Seconds(5))

val filestream = ssc.textFileStream("/home/spark/ch06input")

import java.sql.Timestamp
case class Order(time: java.sql.Timestamp, orderId:Long, clientId:Long, symbol:String, amount:Int, price:Double, buy:Boolean)

import java.text.SimpleDateFormat
val orders = filestream.flatMap(line => {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")
val s = line.split(",")
try {
assert(s(6) == "B" || s(6) == "S")
List(Order(new Timestamp(dateFormat.parse(s(0)).getTime()), s(1).toLong, s(2).toLong, s(3), s(4).toInt, s(5).toDouble, s(6) == "B"))
}
catch {
case e : Throwable => println("Wrong line format ("+e+"): "+line)
List()
}
})

val numPerType = orders.map(o => (o.buy, 1L)).reduceByKey((c1, c2) => c1+c2)

numPerType.repartition(1).saveAsTextFiles("/home/spark/ch06output/output", "txt")

ssc.start()`

I've been following the book in chapter06 step by step, 1. spark-shell --master local[*]. 2. create ssc. 3. ssc.start. 4. excute the shell script. 5. waiting for results. Since I'm in the spark-shell (in VM of course), but couldn't get any results of this case study, all files in ch06output/ directory are empty. Don't know why, anyone can help me...

Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

While doing spark-submit for the realtime dashboard application, i get following error 👍
spark@spark-in-action:~/uc1-docker$ ./run-all.sh
Zookeeper already started
Kafka already started
sia-dashboard already running
Submitting Spark job
Starting Kafka direct stream to broker list: 192.168.10.2:9092
Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at kafka.utils.Pool.(Pool.scala:28)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.(FetchRequestAndResponseStats.scala:60)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.(FetchRequestAndResponseStats.scala)
at kafka.consumer.SimpleConsumer.(SimpleConsumer.scala:39)
at org.apache.spark.streaming.kafka.KafkaCluster.connect(KafkaCluster.scala:52)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:345)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at org.sia.loganalyzer.StreamingLogAnalyzer$.main(StreamingLogAnalyzer.scala:76)
at org.sia.loganalyzer.StreamingLogAnalyzer.main(StreamingLogAnalyzer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 25 more

spark-in-action / first-edition Goto Github PK

first-edition's People

Contributors

Stargazers

Watchers

Forkers

first-edition's Issues

Discretized stream - Number of buy or Sell per second ?

Dashboard sources incorrect

installation of Vagrant

why files in ch06output/output-*.txt are empty!

Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent