Code Monkey home page Code Monkey logo

dissolve-struct's People

Contributors

alucchi avatar martinjaggi avatar tribhuvanesh avatar tvogels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dissolve-struct's Issues

Solver should work with Spark's Vector, instead of Breeze

Currently, the:

  • featureFn signature
  • solver
    operates on Breeze's linalg library. Dependency should be moved to Spark's linalg library, since this contains spark-specific optimizations and will make the solver independent of Breeze library.

support more use-cases, and document them nicely

usability will be a key aspect to make this project work

pystruct has done a good job documenting some nice use-cases for example.
though as we discussed, we should focus on a smaller number of highly relevant ones. we can try to collect a useful list here.

first maybe let's showcase some nice things we can already do with the current oracles we have, e.g.

  • NLP with chain oracle (chunking / POS tagging / entity linking?)
  • vision with a chain oracle (we already have OCR, but that will not convince the vision community yet)
  • bio informatics?

add user-friendly diagnostic checks for the provided decoding oracles

we should very soon add simple checks of any candidate oracles provided by the user.
this checks are to verify that the oracle does not have obvious flaws, and is compatible with the used feature vector which the user has provided. another goal is to verify that it does the loss-augmentation correctly.

such simple checks can avoid many problems later, such as negative duality gaps etc, and would improve the usability of structured SVMs greatly for all users.

here are some simple checks, i'm sure i missed many other possible ones.

  • pick w = all ones, or minus that, or some random signs, or some random w vector
  • pick a random data example (or do the following for all examples)
    1. decoding (without loss term) (of that example at w) must return something which has an inner product at least better than zero
    2. decoding (with loss augmentation) must return something y which has an inner product at least better than the error term of that y.
    3. sample a bunch of random y labels. the answer by the oracle should be at least as good as all these.
      (none of this checks should depend on a distributed setting. these can be run on one machine, before the actual optimization).

these are easy to implement. however, depending if the user claims to have an exact oracle or approximate one, the warnings should sound a bit different if the above is violated...

as the dissolve library will support using several oracles at once soon, it should also know which ones are (claimed to be) exact or not.
for example many diagnostics, such as the structured SVM primal objective, or duality gap, do require an exact oracle. if it's not available, the user should be aware of it, or we should better output a warning.

Code crashes when it tries to write a file in a non existing directory

Code crashes when it tries to write a file in a non existing directory. Shall we create the directory instead?

Exception in thread "main" java.io.FileNotFoundException: ../Results/true_true_Coursera_CoCoa_viterbi_chain_pairwise-1440841761677.csv (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:110)
at java.io.FileWriter.(FileWriter.java:63)
at ch.ethz.dalab.dissolve.classification.StructSVMWithDBCFW.trainModel(StructSVMWithDBCFW.scala:27)

SSGSolver and lossWriterFileName

SSGSolver class outputs loss to lossWriterFileName = "data/debug/ssg-loss.csv". Turns out that the code crashes if the data/debug directory doesn't exist. Either this directory should be created or the path should be changed.

provide oracle for multi-class classification

this is another important use-case we should include before release.
for the popular Crammer-Singer formulation of multi-class classification, this is trivial, features are just copied for each class. then the oracle returns the class of highest score as far as i remember.

as a demo of this, MNIST would be suitable (could re-use our OCR features/setup)

Change sampling strategy

Right now, by setting solverOptions.sampleFrac, say 0.5, the master node samples 50% of the training data and hands each of the K workers 50/K% of the data.

This does not scale well. Since, adding more workers forces the users to change the sampling fraction in order to scale.
Rather, each worker should independently sample the data and then perform the computation.
Doing so implies adding more workers results in faster convergence.

support sparse feature vectors

this is higher priority than allowing sparse data X (the latter one we can keep very generic anyway for the interface)

Image Segmentation application

Create a demonstration of Dissolve in the context of Image Segmentation, beginning with the MSRC dataset.

The plan

Image Segmentation plan

Components

  1. Loading images
  2. Feature Function
  3. Loss Function
  4. Oracle Function using Factorie

Refactor communication between rounds

The intermediate data between the mapper and reducer are clunky:

def mapper(dataIterator: Iterator[(Index, (LabeledObject[X, Y], PrimalInfo, Option[BoundedCacheList[Y]]))],
             localModel: StructSVMModel[X, Y],
             featureFn: (Y, X) => Vector[Double], // (y, x) => FeatureVect, 
             lossFn: (Y, Y) => Double, // (yTruth, yPredict) => LossVal, 
             oracleFn: (StructSVMModel[X, Y], Y, X) => Y, // (model, y_i, x_i) => Lab, 
             predictFn: (StructSVMModel[X, Y], X) => Y,
             solverOptions: SolverOptions[X, Y],
             averagedPrimalInfo: PrimalInfo,
             iterCount: Int,
             miniBatchEnabled: Boolean): Iterator[(StructSVMModel[X, Y], Array[(Index, (PrimalInfo, Option[BoundedCacheList[Y]]))], StructSVMModel[X, Y], PrimalInfo, Int // weighted average model
             )]

def reducer( // sc: SparkContext,
    zippedModels: RDD[(StructSVMModel[X, Y], Array[(Index, (PrimalInfo, Option[BoundedCacheList[Y]]))], StructSVMModel[X, Y], PrimalInfo, Int)], // The optimize step returns k blocks. Each block contains (\Delta LocalModel, [\Delta PrimalInfo_i]).
    oldPrimalInfo: RDD[(Index, PrimalInfo)],
    oldCache: RDD[(Index, BoundedCacheList[Y])],
    oldGlobalModel: StructSVMModel[X, Y],
    oldWeightedAveragePrimals: PrimalInfo,
    d: Int,
    beta: Double): (StructSVMModel[X, Y], RDD[(Index, PrimalInfo)], RDD[(Index, BoundedCacheList[Y])], PrimalInfo, Int)

Consider refactoring by using an intermediate case class for transferring intermediate data in between communication rounds.

Logging

Logging currently is through:

  1. Dumping statements to sysout
  2. Stitching a debugString during execution with debug data.

All this needs to be moved to using apache log4j similar to what MLLib does.

Move ChainDemo train order to unit test

ChainDemo depends on file "chain_train.csv" to read a specific order of the training examples.
Instead, move this to a unit test and let ChainDemo decide its own order.

benchmark for binary SVM

just to make sure we are not having big overheads in the code. figures should be comparable to the experiments in the CoCoA paper (which is binary SVM), also if the number of machines is increased.

e.g. check again if checkpointing the RDDs screws or helps performance...

debug info: don't output NaN

if no duality gap is computed, those columns should not be printed out.

maybe the debug flag could be explained more clearly to the user (also the number of rounds between debug computations).

Move demo package into DBCFWStruct

dissolvestruct now exists in two packages

  1. The actual Solver code
  2. A separate package containing various examples

Merge these two into a single package and modify the sbt build to build with/without the demos.

avoid dense vector operations in local solver

for good performance on sparse, we shouldn't use any dense vector operation in a single coordinate update. such an update needs to have cost only (number of non-zeros of the returned feature vector by the oracle), which can be very sparse.

e.g. we should remove the dense temp variables here for example,
https://github.com/dalab/dissolve-struct/blob/master/dissolve-struct-lib/src/main/scala/ch/ethz/dalab/dissolve/optimization/DBCFWSolverTuned.scala#L961

we can check how that affects performance on a large sparse binarySVM dataset.

Move from breeze's vectors to Spark's vectors

Spark internally uses Breeze library linear algebra operations.
Use operations and structures from here instead of importing breeze.

Also, the feature map requires the output to be a breeze Vector. Change this to Spark's Vector.

allow approximate decoding oracles

allow the oracle to return crappy labels (test with intentionally messing up the answer e.g.)

possible criterion for a good enough answer (see also cache, issue #1)
gamma >= 0.5 * gamma_fix

in the longer term, our interface for the oracles should allow us, i.e. the solver to call the oracle again with a flag requiring higher accuracy (maybe higher and higher etc). if the resulting gamma is never positive, then we stop optimizing and say that the oracle was not good enough.

some of the optimization based oracles in pystruct have this functionality as well:
pystruct.github.io

cache for oracle answers (labels)

Implement a cache for, which can be switched on/off as a solver option. this should be invisible to the user writing the oracle code, and work for all types of labels.

for each datapoint, the cache -if enabled- returns a set of labels (e.g. the previous n oracle answers, n again being a solver option, 0 = no cache).
our algorithm then

  1. retrieves all labels from the cache
  2. puts them into the gamma line-search formula (this requires computing their feature vectors, so one could even thing about caching these instead of the labels).
  3. if the resulting gamma is good (good as defined by the same criterion as when using approximate oracles, see issue #2 , e.g. gamma >= 0.5 * gamma_fix), then it performs the update with that label. if not, the oracle is called.

a similar functionality is built into SVMstruct, and helps a lot in practice (though it does not improve the theoretical rate as far as we know).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.