Code Monkey home page Code Monkey logo

Comments (6)

witgo avatar witgo commented on August 10, 2024

@benmccann Thanks, I have to keep track of this bug. Progress will be posted here.

from zen.

witgo avatar witgo commented on August 10, 2024

@benmccann
#63 can fix this bug?

from zen.

benmccann avatar benmccann commented on August 10, 2024

Nope. Still getting the error. Thank you for the suggestion

from zen.

witgo avatar witgo commented on August 10, 2024

We do not have enough machines to test the bug and only have four machines (160 cores, 1T memory).
In my tests I did not find the error.

The following conf/spark-defaults.conf:

spark.master = yarn-client
spark.driver.memory = 30g
spark.executor.cores = 4
spark.executor.instances = 24
spark.executor.memory = 20g

Test data: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2

Test code:

  import com.github.cloudml.zen.ml.recommendation.{FMModel, FMClassification}
  import org.apache.spark.storage.StorageLevel
  import org.apache.spark.mllib.util.MLUtils

  val storageLevel = StorageLevel.MEMORY_AND_DISK
  val data = MLUtils.loadLibSVMFile(sc, "/witgo/kddb").repartition(96)
  val Array(trainSet, testSet) = data.zipWithUniqueId().map(_.swap).randomSplit(Array(0.9, 0.1))
  trainSet.persist(storageLevel).count()
  testSet.persist(storageLevel).count()

  val numIterations = 100
  val stepSize = 0.1
  val l2 = (0.01, 0.01, 0.01)
  val rank = 32
  val useAdaGrad = true
  val lfm = new FMClassification(trainSet, stepSize, l2, rank, useAdaGrad, 1.0, storageLevel)

  var iter = 0
  var model: FMModel = null
  while (iter < numIterations) {
    val thisItr = math.min(50, numIterations - iter)
    iter += thisItr
    if (model != null) model.factors.unpersist(false)
    lfm.run(thisItr)
    model = lfm.saveModel()
    model.factors.persist(storageLevel)
    model.factors.count()
    val auc = model.loss(testSet)
    println(f"(Iteration $iter/$numIterations) Test AUC:                     $auc%1.6f")
  }

from zen.

benmccann avatar benmccann commented on August 10, 2024

I'm not sure if it's 100 machines or just 100 executors that are required to be able to reproduce the bug. You may have more luck reproducing with:

spark.executor.cores = 1
spark.executor.instances = 120

You could probably run on AWS EMR to reproduce as well.

from zen.

witgo avatar witgo commented on August 10, 2024

I used the 120 executors still can not reproduce the bug. It seems to be caused by other reasons.

from zen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.