dalab / dissolve-struct Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 5.0 42.85 MB

Distributed solver library for large-scale structured output prediction, based on Spark. Project website:

Home Page: http://dalab.github.io/dissolve-struct/

License: Apache License 2.0

Shell 0.28% Scala 77.70% Python 10.64% TeX 11.38%

dissolve-struct's People

Contributors

Stargazers

Watchers

Forkers

reisepass jderiu tvogels michaelldd semanticbeeng

dissolve-struct's Issues

Solver should work with Spark's Vector, instead of Breeze

Currently, the:

featureFn signature
solver
operates on Breeze's linalg library. Dependency should be moved to Spark's linalg library, since this contains spark-specific optimizations and will make the solver independent of Breeze library.

support more use-cases, and document them nicely

usability will be a key aspect to make this project work

pystruct has done a good job documenting some nice use-cases for example.
though as we discussed, we should focus on a smaller number of highly relevant ones. we can try to collect a useful list here.

first maybe let's showcase some nice things we can already do with the current oracles we have, e.g.

NLP with chain oracle (chunking / POS tagging / entity linking?)
vision with a chain oracle (we already have OCR, but that will not convince the vision community yet)
bio informatics?

add user-friendly diagnostic checks for the provided decoding oracles

we should very soon add simple checks of any candidate oracles provided by the user.
this checks are to verify that the oracle does not have obvious flaws, and is compatible with the used feature vector which the user has provided. another goal is to verify that it does the loss-augmentation correctly.

such simple checks can avoid many problems later, such as negative duality gaps etc, and would improve the usability of structured SVMs greatly for all users.

here are some simple checks, i'm sure i missed many other possible ones.

pick w = all ones, or minus that, or some random signs, or some random w vector
pick a random data example (or do the following for all examples)
1. decoding (without loss term) (of that example at w) must return something which has an inner product at least better than zero
2. decoding (with loss augmentation) must return something y which has an inner product at least better than the error term of that y.
3. sample a bunch of random y labels. the answer by the oracle should be at least as good as all these.
  (none of this checks should depend on a distributed setting. these can be run on one machine, before the actual optimization).

these are easy to implement. however, depending if the user claims to have an exact oracle or approximate one, the warnings should sound a bit different if the above is violated...

as the dissolve library will support using several oracles at once soon, it should also know which ones are (claimed to be) exact or not.
for example many diagnostics, such as the structured SVM primal objective, or duality gap, do require an exact oracle. if it's not available, the user should be aware of it, or we should better output a warning.

can we interface to the python oracles as pystruct does?

this already has some nice decoding oracles, see inference section in
pystruct.github.io/references.html

including linear programming (LP) relaxation (related to #2 as this is not exact), as well as belief propagation (BP)

@alucchi could you advise @tribhuvanesh which ones would be most useful for us currently?

Code crashes when it tries to write a file in a non existing directory

Code crashes when it tries to write a file in a non existing directory. Shall we create the directory instead?

Exception in thread "main" java.io.FileNotFoundException: ../Results/true_true_Coursera_CoCoa_viterbi_chain_pairwise-1440841761677.csv (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:110)
at java.io.FileWriter.(FileWriter.java:63)
at ch.ethz.dalab.dissolve.classification.StructSVMWithDBCFW.trainModel(StructSVMWithDBCFW.scala:27)

find name for this package

SSGSolver and lossWriterFileName

SSGSolver class outputs loss to lossWriterFileName = "data/debug/ssg-loss.csv". Turns out that the code crashes if the data/debug directory doesn't exist. Either this directory should be created or the path should be changed.

provide oracle for multi-class classification

this is another important use-case we should include before release.
for the popular Crammer-Singer formulation of multi-class classification, this is trivial, features are just copied for each class. then the oracle returns the class of highest score as far as i remember.

as a demo of this, MNIST would be suitable (could re-use our OCR features/setup)

Change sampling strategy

Right now, by setting solverOptions.sampleFrac, say 0.5, the master node samples 50% of the training data and hands each of the K workers 50/K% of the data.

This does not scale well. Since, adding more workers forces the users to change the sampling fraction in order to scale.
Rather, each worker should independently sample the data and then perform the computation.
Doing so implies adding more workers results in faster convergence.

support sparse feature vectors

this is higher priority than allowing sparse data X (the latter one we can keep very generic anyway for the interface)

remove LevelHistory stuff from solver

we can remove the
indexedLevelHistoryCache
and related stuff again as it's not necessary for the library.
(mainly affects the solver)

benchmark for scaling in the distributed setting

we want to demonstrate that the CoCoA implementation scales almost linearly if we add more machines, if the oracle is non-trivial (e.g. chain, OCR maybe?)

diagnostics (oracle safety checks) should be part of lib, not examples

and should be easier to call.
one idea: maybe just a solver flag, which will run the check before starting the actual solver?

Image Segmentation application

Create a demonstration of Dissolve in the context of Image Segmentation, beginning with the MSRC dataset.

The plan

Components

Loading images
Feature Function
Loss Function
Oracle Function using Factorie

Refactor communication between rounds

The intermediate data between the mapper and reducer are clunky:

def mapper(dataIterator: Iterator[(Index, (LabeledObject[X, Y], PrimalInfo, Option[BoundedCacheList[Y]]))],
             localModel: StructSVMModel[X, Y],
             featureFn: (Y, X) => Vector[Double], // (y, x) => FeatureVect, 
             lossFn: (Y, Y) => Double, // (yTruth, yPredict) => LossVal, 
             oracleFn: (StructSVMModel[X, Y], Y, X) => Y, // (model, y_i, x_i) => Lab, 
             predictFn: (StructSVMModel[X, Y], X) => Y,
             solverOptions: SolverOptions[X, Y],
             averagedPrimalInfo: PrimalInfo,
             iterCount: Int,
             miniBatchEnabled: Boolean): Iterator[(StructSVMModel[X, Y], Array[(Index, (PrimalInfo, Option[BoundedCacheList[Y]]))], StructSVMModel[X, Y], PrimalInfo, Int // weighted average model
             )]

def reducer( // sc: SparkContext,
    zippedModels: RDD[(StructSVMModel[X, Y], Array[(Index, (PrimalInfo, Option[BoundedCacheList[Y]]))], StructSVMModel[X, Y], PrimalInfo, Int)], // The optimize step returns k blocks. Each block contains (\Delta LocalModel, [\Delta PrimalInfo_i]).
    oldPrimalInfo: RDD[(Index, PrimalInfo)],
    oldCache: RDD[(Index, BoundedCacheList[Y])],
    oldGlobalModel: StructSVMModel[X, Y],
    oldWeightedAveragePrimals: PrimalInfo,
    d: Int,
    beta: Double): (StructSVMModel[X, Y], RDD[(Index, PrimalInfo)], RDD[(Index, BoundedCacheList[Y])], PrimalInfo, Int)

Consider refactoring by using an intermediate case class for transferring intermediate data in between communication rounds.

Logging

Logging currently is through:

Dumping statements to sysout
Stitching a debugString during execution with debug data.

All this needs to be moved to using apache log4j similar to what MLLib does.

Spark overrides user specified Breeze Version causing error for Spark 1.2

It appears that when running Dissolve on Spark 1.2 it preferentially uses its own version of breeze which is 0.10 https://github.com/apache/spark/blob/branch-1.2/mllib/pom.xml . Hence we should throw an error if someone tires to use anything below spark 1.3 ( https://github.com/apache/spark/blob/branch-1.3/mllib/pom.xml )

Move ChainDemo train order to unit test

ChainDemo depends on file "chain_train.csv" to read a specific order of the training examples.
Instead, move this to a unit test and let ChainDemo decide its own order.

benchmark for binary SVM

just to make sure we are not having big overheads in the code. figures should be comparable to the experiments in the CoCoA paper (which is binary SVM), also if the number of machines is increased.

e.g. check again if checkpointing the RDDs screws or helps performance...

debug info: don't output NaN

if no duality gap is computed, those columns should not be printed out.

maybe the debug flag could be explained more clearly to the user (also the number of rounds between debug computations).

Move demo package into DBCFWStruct

dissolve^struct now exists in two packages

The actual Solver code
A separate package containing various examples

Merge these two into a single package and modify the sbt build to build with/without the demos.

chain example: make normalized hamming loss a configuration flag

(so that one can use normalization or not)

see e.g.
https://github.com/dalab/dissolve-struct/blob/master/dissolve-struct-examples/src/main/scala/ch/ethz/dalab/dissolve/examples/chain/ChainDemo.scala#L122

provide oracle doing general Belief Propogation, using the Factorie scala library

Provide Oracle to infer through BP

some documentation at factorie.cs.umass.edu

avoid dense vector operations in local solver

for good performance on sparse, we shouldn't use any dense vector operation in a single coordinate update. such an update needs to have cost only (number of non-zeros of the returned feature vector by the oracle), which can be very sparse.

e.g. we should remove the dense temp variables here for example,
https://github.com/dalab/dissolve-struct/blob/master/dissolve-struct-lib/src/main/scala/ch/ethz/dalab/dissolve/optimization/DBCFWSolverTuned.scala#L961

we can check how that affects performance on a large sparse binarySVM dataset.

Move from breeze's vectors to Spark's vectors

Spark internally uses Breeze library linear algebra operations.
Use operations and structures from here instead of importing breeze.

Also, the feature map requires the output to be a breeze Vector. Change this to Spark's Vector.

Support with-replacement strategy

Currently, the solver has been only tested with without-replacement strategy. Support with-replacement too going forward.

allow approximate decoding oracles

allow the oracle to return crappy labels (test with intentionally messing up the answer e.g.)

possible criterion for a good enough answer (see also cache, issue #1)
gamma >= 0.5 * gamma_fix

in the longer term, our interface for the oracles should allow us, i.e. the solver to call the oracle again with a flag requiring higher accuracy (maybe higher and higher etc). if the resulting gamma is never positive, then we stop optimizing and say that the oracle was not good enough.

some of the optimization based oracles in pystruct have this functionality as well:
pystruct.github.io

cache for oracle answers (labels)

Implement a cache for, which can be switched on/off as a solver option. this should be invisible to the user writing the oracle code, and work for all types of labels.

for each datapoint, the cache -if enabled- returns a set of labels (e.g. the previous n oracle answers, n again being a solver option, 0 = no cache).
our algorithm then

retrieves all labels from the cache
puts them into the gamma line-search formula (this requires computing their feature vectors, so one could even thing about caching these instead of the labels).
if the resulting gamma is good (good as defined by the same criterion as when using approximate oracles, see issue #2 , e.g. gamma >= 0.5 * gamma_fix), then it performs the update with that label. if not, the oracle is called.

a similar functionality is built into SVMstruct, and helps a lot in practice (though it does not improve the theoretical rate as far as we know).

solver: implement CoCoA+ instead of CoCoA

(or allow both with a solver flag)