Code Monkey home page Code Monkey logo

ashpy's People

Contributors

dependabot[bot] avatar emanueleghelfi avatar galeone avatar ilew avatar mr-ubik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ashpy's Issues

[BUG] - Measuring performance at the end of each epoch, disables training logging

The title says pretty much all.

If I want to measure the performance of the model only at the end of every epoch, the (for example) ClassifierTrainer offers me the possibility of setting a value for the logging frequency to something <= 0.

In this way, I measure the performance at the end of every epoch on the validation set (it works).

However, on tensorboard I only see the plot of the validation curves, the training curves aren't displayed anymore

[BUG] - Unable to do model selection on a decreasing metric

Describe the bug
I expect to be able to do model selecting, using the classifier loss as metric. Actually, I find only a JSON with this content:

cat ~/log/on/best/loss/loss.json

{
    "loss": "-inf",
    "step": "0"
}

Expected behavior

I should find in the logdir, the best folder containing the JSON + the checkpoint files of the best model, wrt the chosen metric.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.


def get_model():
    """Create a new autoencoder tf.keras.Model."""
    autoencoder = Autoencoder(
        (64, 64),
        (4, 4),
        kernel_size=3,
        initial_filters=16,
        filters_cap=64,
        encoding_dimension=50,
        channels=3,
    )

    # encoding, representation = autoencoder(input)
    inputs = tf.keras.layers.Input(shape=(64, 64, 3))
    _, reconstruction = autoencoder(inputs)
    model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
    return model


def _train(dataset: tf.data.Dataset, logdir: Path):
    reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
    autoencoder = get_model()

    ClassifierTrainer(
        model=autoencoder,
        optimizer=tf.optimizers.Adam(1e-4),
        loss=reconstruction_error,
        metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
        logdir=str(logdir),
        epochs=100,
    )(dataset, dataset)

In short, the initialization of the JSON is wrong when I want to select a model whose metric is decreasing.

Precision of comparison when doing model selection

See the following log:

AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[960] loss: 0.04510442167520523
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[970] loss: 0.04019254446029663
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[980] loss: 0.03933567926287651
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[990] loss: 0.03875984624028206
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1000] loss: 0.03433336690068245
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1010] loss: 0.041549138724803925
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1020] loss: 0.040606431663036346
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1030] loss: 0.041963666677474976
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041

As you can see, this is a constant metric, that gets updated every time when model selection is performed, because we compare the saved value with the newly computed value, which is identical but with some decimal digit.

We should fix the number of digits we want to take into consideration

[BUG] - log folder created locally, although specified outside of cwd

Describe the bug
The title describes the bug.

Expected behavior
If I ask ashpy to use a logdir outside of the cwd, nothing in the cwd should change.
Instead, currently, the JSON of the best model is created in the local directory.

Code to reproduce the issue

def get_model():
    """Create a new autoencoder tf.keras.Model."""
    autoencoder = Autoencoder(
        (64, 64),
        (4, 4),
        kernel_size=3,
        initial_filters=16,
        filters_cap=64,
        encoding_dimension=50,
        channels=3,
    )

    # encoding, representation = autoencoder(input)
    inputs = tf.keras.layers.Input(shape=(64, 64, 3))
    _, reconstruction = autoencoder(inputs)
    model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
    return model


def _train(dataset: tf.data.Dataset, logdir: Path):
    reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
    autoencoder = get_model()

    ClassifierTrainer(
        model=autoencoder,
        optimizer=tf.optimizers.Adam(1e-4),
        loss=reconstruction_error,
        metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
        logdir=str(logdir),
        epochs=100,
    )(dataset, dataset)

[BUG/PERFORMANCE] - Handling of logdir: conflict between metrics and trainer

The logdir can be defined in trainers and in metrics.

The actual behaviour is to override the metric's logdir using the trainer's logdir.

However, the logdir parameter in trainer is optional and it has a default value (cwd + "log").

logdir = "mylogdir"

precision = ClassifierMetric(
            metric=tf.keras.metrics.Precision(),
            model_selection_operator=operator.gt,
            logdir=logdir,
)

trainer = ClassifierTrainer(
            model=self.model,
            optimizer=optimizer,
            loss=loss,
            epochs=epochs,
            metrics=[precision], 
            callbacks=callbacks,
)
 trainer(
            self.train_dataset.batch(batch_size).prefetch(1),
            self.validation_dataset.batch(batch_size).prefetch(1),
)

Ashpy logs in the directory "log" instead of the directory "mylogdir".

Possible solution: remove logdir from the metric's __init__ and set the logdir from the trainer.

[Optional] remove logdir default values.

[BUG/API] - Passing name to Custom ClassifierMetric(s)

System information

  • AshPy version:

Describe the bug
No easy way to specify a name for subclassed custom ClassifierMetric

Expected behavior
While subclassing ClassifierMetric it should be possible to specify a name for the metric.

Restarting training doesn't work well

Screenshot_20200311_190512

As you can see, there is a huge drop in accuracy and a peak in the loss value.
This happened when restoring a model training (let's say, I trained the model for 10 epochs, train finished, then I changed the number of epochs to 15 and restarted the train).

[FEATURE] - Need a proper way to restore best model from checkpoint

Right now the only way to correctly restore the content of the checkpoint of the best model is to re-create a trainer, and even if there is no need to train the model, re-instantiate everything in order to be able to restore the model.

Moreover, if we don't specify certain elements of the trainer, we got warnings like

WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Thus, a better way of restoring only the best model fro the checkpoint (created from the model selection, is needed).

[FEATURE] - Logging instead of print

System information

  • AshPy version (you are using): 0.4.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state

Currently, we use a combination of print() and tf.print() to handle console output. Unifying them via a standardized logging could increase clarity and give the end-user control over the displayed messages via a logger while letting the developer have better control over debug, info, and warning messages..

Will this change the current API? How?

No change to it.

Who will benefit with this feature?

  • End Users
  • Developers

[DOC] - Customizing Ashpy Examples

  • Add documentation and examples for the implementation of custom Metric.
    • How to convert a keras.metrics.Metric into an Ashpy one?
  • Add documentation and examples for the implementation of custom Callbacks.

[FEATURE] - Add colab examples

Describe the feature and the current behavior/state

Right now there are no examples "click and go", to run on Google Colab. There are no notebooks that explain and show how to use AshPy and why this solution should be better or more useful than using the pure Keras API or other solutions.

Will this change the current api? How?

No, it doesn't.

Who will benefit with this feature?

Everyone. Since, OK examples in the documentation and in the README are cool, but having some colab notebook is more powerful and easy to share (moreover, it's more catchy and people love executing cells and learning)

No such file or directory checkpoint_map.json when restoring model from model selection

The title says pretty much everything.

Creating a restorer (in particular a ClassifierRestorer but it is not important) passing as checkpoint directory, the directory where the checkpoint of the best model has been saved causes an error.

two options:

  1. We have to create the same file also when doing model selection
  2. We have to make it possible to restore the model from the checkpoint even if the file is not present

I like the first one

[BUG] - Restorer does not work with distributed training

Describe the bug
Run examples/gan/pix2pix_facades_multi_gpu.py in a multi-gpu scenario. If you try to restore the training once finished you get an error due to wrong input shapes.
This is because in a multi-gpu scenario the batch size gets updated based on the number of devices.
Simply move the call to build_or_restore after the batch size update.

Expected behavior
The restorer should restore the models.

Code to reproduce the issue
examples/gan/pix2pix_facades_multi_gpu.py

[FEATURE] - Add examples to test suite

System information

  • AshPy version (you are using): 1.0.2
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state
Currently examples are not tested. We can think to add examples in the test suite or in a separate stage of travis in order to check that every example still work after some changes.

Will this change the current api? How?
We need to change the travis.yml configuration.

Who will benefit with this feature?
Everyone.

Model selection JSON is overwritten on train restart

Scenario: define a training process, with a metric used for model selection. Then:

  1. Train a model for N epochs, stop the training, look the best/file.json -> it contains values
  2. Restart the training (setting the number of epochs to something greater than N): the `fil.json' is overwritten with the default, zero values.

Example:

I trained a model, and I got this JSON with the values:

{
    "AEAccuracy": "0.79296064",
    "step": "8370",
    "positive_threshold": "0.019780229777097702",
    "positive_variance": "4.077112680533901e-05",
    "negative_threshold": "0.0740804448723793",
    "negative_variance": "4.077112680533901e-05"
}

After restarting the train, the file is overwritten and reset to the default values of

{
    "AEAccuracy": "-inf",
    "step": "0",
    "positive_threshold": "0.0",
    "positive_variance": "0.0",
    "negative_threshold": "0.0",
    "negative_variance": "0.0"
}

Thus, the model selection process starts again, but we will for sure overwrite the previous best model.

[FEATURE] - Update to Python 3.8

System information

  • AshPy version (you are using): 0.4.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state

Currently, we were stuck using Python 3.7 due to TensorFlow not supporting 3.8.
Python 3.8 should now be supported by TF. AshPy can make the switch too.

Will this change the current api? How?

It should not create any breaking changes unless we start using 3.8 exclusive features.

Who will benefit with this feature?

All users?

[BUG] - Restorer does not work with GANs conditioned

Describe the bug
Run examples/gans/facades.py, stop the training after some time and then try to restart the training. You get an error due to the fact that the discriminator expects a list of two tensors.

Expected behavior
The training should restart with models restored correctly.

Code to reproduce the issue
Simply run examples/gans/facades.py.

Other info / logs
Inside build_and_restore we should check the size of discriminator inputs (and also the size of generator inputs).

[FEATURE] - Add a flag for correct scaling of logged TensorBoard images

System information

  • AshPy version (you are using): 0.4.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state
TensorBoard images should be logged in the [0, 1] range when float.
Currently, we do not provide any type of automatic handling of this scaling, so if you are training on [-1, 1] and using our callbacks, images will be logged without scaling.

Will this change the current API? How?
Small change to ashpy.utils.log() and all the callbacks using it, they will now accept a new boolean argument auto_tb_scaling controlling this behavior.

Who will benefit with this feature?
Anyone using the default logging facilities.

Any Other info

[BUG] - Context are not checkpointable; no optimizer in Context; global_step initilization not OK

The file src/ashpy/contexts/context.py contains the following docstring

r"""
 Primitive Context Interface.
 
 ``Contexts`` are checkpointable (subclassed from :py:class:`tf.train.Checkpoint`)
 collections of variable encapsulated in a Python Class as a way to seamlessly
 handle information transfer.
 """

but the Context class inherits from Object, not from tf.train.Checkpoint.

Also, there is no way to access the optimizer from the Context object.

Moreover, we initialize in the constructor (thus, during the declaration since tf.Variable are mutable objects) the global_step with tf.Variable(0, name="global_step", trainable=False, dtype=tf.int64). This must be avoided since a context is created always from (magic) a well-defined context - therefore we can initialize this to None

Hi! Why didn't train for self._discriminator?

    real_x, real_y = real_xy

    if len(self._generator.inputs) == 2:
        g_inputs = [g_inputs, real_y]

    with tf.GradientTape(persistent=True) as tape:
        fake = self._generator(g_inputs, training=True)
        logits_fake = self._discriminator(fake)  #why didn't add this and next line?
        logits_real = self._discriminator(real_y)
        d_loss = self._d_loss(
            self._context, fake=fake, real=real_x, condition=real_y, training=True
        )

        g_loss = self._g_loss(
            self._context, fake=fake, real=real_x, condition=real_y, training=True
        )

    # check that we have some trainable_variables
    assert self._generator.trainable_variables
    assert self._discriminator.trainable_variables

    # calculate the gradient
    d_gradients = tape.gradient(d_loss, self._discriminator.trainable_variables)
    g_gradients = tape.gradient(g_loss, self._generator.trainable_variables)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.