zurutech / ashpy Goto Github PK

View Code? Open in Web Editor NEW

82.0 82.0 10.0 580 KB

TensorFlow 2.0 library for distributed training, evaluation, model selection, and fast prototyping.

Home Page: https://ashpy.zurutech.io/

License: Apache License 2.0

Python 100.00%

classification deep-learning gan keras-tensorflow machine-learning pix2pix tensorflow tensorflow2 training

ashpy's People

Contributors

Stargazers

Watchers

Forkers

emanueleghelfi zxlzr awesome-archive tshepomk qooglewb robotoil mr-ubik paojianghu softdzx aminurds

ashpy's Issues

The model selection JSON file is overwritten with defaults on re-start

Scenario:

You're training a model and doing model selection
You stop the train
You re-start the train

The JSON file of the model selection is overwritten with default values, making you lost the previously (correctly stored) value.

[TESTS] - Add test case for restoring Sequential Model without input shape

ashpy/src/ashpy/restorers/restorer.py

Lines 123 to 128 in 7efd7dd

    
                   TODO: add test case for the Sequential without input shape 
        
                   """ 
        
                   try: 
        
                       if restored_model.weights == []: 
        
                           raise ModelNotConstructedError 
        
                   except AttributeError:

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

[DOC] - Enlarge Horizontal view of code

URL(s) with the issue:

https://ashpy.zurutech.io/en/latest/_modules/ashpy/models/gans.html#Generator

Description of issue (what needs changing):

The source code page can be enlarged in order to see the whole code

[BUG] - Measuring performance at the end of each epoch, disables training logging

The title says pretty much all.

If I want to measure the performance of the model only at the end of every epoch, the (for example) ClassifierTrainer offers me the possibility of setting a value for the logging frequency to something <= 0.

In this way, I measure the performance at the end of every epoch on the validation set (it works).

However, on tensorboard I only see the plot of the validation curves, the training curves aren't displayed anymore

[BUG] - Unable to do model selection on a decreasing metric

Describe the bug
I expect to be able to do model selecting, using the classifier loss as metric. Actually, I find only a JSON with this content:

cat ~/log/on/best/loss/loss.json

{
    "loss": "-inf",
    "step": "0"
}

Expected behavior

I should find in the logdir, the best folder containing the JSON + the checkpoint files of the best model, wrt the chosen metric.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.


def get_model():
    """Create a new autoencoder tf.keras.Model."""
    autoencoder = Autoencoder(
        (64, 64),
        (4, 4),
        kernel_size=3,
        initial_filters=16,
        filters_cap=64,
        encoding_dimension=50,
        channels=3,
    )

    # encoding, representation = autoencoder(input)
    inputs = tf.keras.layers.Input(shape=(64, 64, 3))
    _, reconstruction = autoencoder(inputs)
    model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
    return model


def _train(dataset: tf.data.Dataset, logdir: Path):
    reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
    autoencoder = get_model()

    ClassifierTrainer(
        model=autoencoder,
        optimizer=tf.optimizers.Adam(1e-4),
        loss=reconstruction_error,
        metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
        logdir=str(logdir),
        epochs=100,
    )(dataset, dataset)

In short, the initialization of the JSON is wrong when I want to select a model whose metric is decreasing.

[DOC] - UNET Docs Link to Pix2Pix paper actually points to LSGAN

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub.

URL(s) with the issue:

https://ashpy.readthedocs.io/en/latest/_modules/ashpy/models/convolutional/unet.html#UNet

Description of the issue (what needs changing):

Make it point to https://arxiv.org/abs/1611.07004

Precision of comparison when doing model selection

See the following log:

AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[960] loss: 0.04510442167520523
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[970] loss: 0.04019254446029663
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[980] loss: 0.03933567926287651
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[990] loss: 0.03875984624028206
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1000] loss: 0.03433336690068245
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1010] loss: 0.041549138724803925
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1020] loss: 0.040606431663036346
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1030] loss: 0.041963666677474976
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041

As you can see, this is a constant metric, that gets updated every time when model selection is performed, because we compare the saved value with the newly computed value, which is identical but with some decimal digit.

We should fix the number of digits we want to take into consideration

[API] - Not so good default value for ashpy.metrics.ClassifierMetric

The default value for the processing_predictions argument of ashpy.metrics.ClassifierMetric does lead to some issues when kept at default while working with metrics such as FBetaScore, Precision and Recall.

[BUG/PERFORMANCE] - Deprecation in Distribution Strategies

StrategyBase.experimental_run_v2 (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
renamed to `run`

[BUG] - log folder created locally, although specified outside of cwd

Describe the bug
The title describes the bug.

Expected behavior
If I ask ashpy to use a logdir outside of the cwd, nothing in the cwd should change.
Instead, currently, the JSON of the best model is created in the local directory.

Code to reproduce the issue

def get_model():
    """Create a new autoencoder tf.keras.Model."""
    autoencoder = Autoencoder(
        (64, 64),
        (4, 4),
        kernel_size=3,
        initial_filters=16,
        filters_cap=64,
        encoding_dimension=50,
        channels=3,
    )

    # encoding, representation = autoencoder(input)
    inputs = tf.keras.layers.Input(shape=(64, 64, 3))
    _, reconstruction = autoencoder(inputs)
    model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
    return model


def _train(dataset: tf.data.Dataset, logdir: Path):
    reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
    autoencoder = get_model()

    ClassifierTrainer(
        model=autoencoder,
        optimizer=tf.optimizers.Adam(1e-4),
        loss=reconstruction_error,
        metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
        logdir=str(logdir),
        epochs=100,
    )(dataset, dataset)

[TESTS] - Test a Callback that retrieves a Metric from its Context

ashpy/tests/callbacks/__init__.py

Lines 4 to 6 in 7efd7dd

    
           TODO: test a callback that retrieves a Metric from its Context 
        
           TODO: Restructure these tests 
        
           """

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

[TESTS] - Restructure Callbacks tests

ashpy/tests/callbacks/__init__.py

Lines 5 to 6 in 7efd7dd

    
           TODO: Restructure these tests 
        
           """

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

[BUG/PERFORMANCE] - Handling of logdir: conflict between metrics and trainer

The logdir can be defined in trainers and in metrics.

The actual behaviour is to override the metric's logdir using the trainer's logdir.

However, the logdir parameter in trainer is optional and it has a default value (cwd + "log").

logdir = "mylogdir"

precision = ClassifierMetric(
            metric=tf.keras.metrics.Precision(),
            model_selection_operator=operator.gt,
            logdir=logdir,
)

trainer = ClassifierTrainer(
            model=self.model,
            optimizer=optimizer,
            loss=loss,
            epochs=epochs,
            metrics=[precision], 
            callbacks=callbacks,
)
 trainer(
            self.train_dataset.batch(batch_size).prefetch(1),
            self.validation_dataset.batch(batch_size).prefetch(1),
)

Ashpy logs in the directory "log" instead of the directory "mylogdir".

Possible solution: remove logdir from the metric's __init__ and set the logdir from the trainer.

[Optional] remove logdir default values.

[DOC] - Typos

If anyone finds a typo or any small error in the docs, please, report it here. Thanks ❤️

Dataset format: in Pix2Pix the condition is the generator input, not the output. Fixed in 1bf4103
Broken formatting in https://ashpy.readthedocs.io/en/latest/_autosummary/callbacks/ashpy.callbacks.callback.Callback.html#callback

[DOC] - Badges should be link to meaningful things

Badges should link not to the image but to something relevant: i.e. clicking on package: ashpy should take you to the pypi ashpy page.

[BUG/API] - Passing name to Custom ClassifierMetric(s)

System information

AshPy version:

Describe the bug
No easy way to specify a name for subclassed custom ClassifierMetric

Expected behavior
While subclassing ClassifierMetric it should be possible to specify a name for the metric.

Restarting training doesn't work well

As you can see, there is a huge drop in accuracy and a peak in the loss value.
This happened when restoring a model training (let's say, I trained the model for 10 epochs, train finished, then I changed the number of epochs to 15 and restarted the train).

[DOC] - Add a list of things using AshPy

Add either to the docs or to a separate wiki a list with projects (but also articles, tutorials, etc...) using our library.

[DOC] - Add logo in doc sidebar

URL(s) with the issue:

https://ashpy.zurutech.io/en/latest/index.html

Description of issue (what needs changing):

The logo can be added to the sidebar like in the keras doc: https://keras.io/

[DOC] - Add documentation for things raising exceptions.

ashpy/src/ashpy/restorers/restorer.py

Lines 122 to 127 in 7efd7dd

    
                   TODO: add docs for the exception. 
        
                   TODO: add test case for the Sequential without input shape 
        
                   """ 
        
                   try: 
        
                       if restored_model.weights == []: 
        
                           raise ModelNotConstructedError

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

[FEATURE] - Need a proper way to restore best model from checkpoint

Right now the only way to correctly restore the content of the checkpoint of the best model is to re-create a trainer, and even if there is no need to train the model, re-instantiate everything in order to be able to restore the model.

Moreover, if we don't specify certain elements of the trainer, we got warnings like

WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Thus, a better way of restoring only the best model fro the checkpoint (created from the model selection, is needed).

[DOC] - Clarify batch size behavior under a distribution strategy

Hinge Loss for generator needs to be fixed

Line:

ashpy/ashpy/keras/losses.py

Line 266 in f053532

fake_loss = -tf.nn.relu(d_fake)

The loss is calculated as - tf.nn.relu(d_fake)

However, as in SA-GAN:
https://github.com/brain-research/self-attention-gan/blob/ad9612e60f6ba2b5ad3d3340ebae60f724636d75/model.py#L73

we should only minimize - d_fake

[DOC] - Broken documentation ashpy.models.gans

URL(s) with the issue:

https://ashpy.readthedocs.io/en/latest/_autosummary/models/ashpy.models.gans.html#

Description of issue (what needs changing):

No documentations, links to varioous GANs's components are broken.

[DOC] - Can we enable Inception Score in example back?

ashpy/examples/gans/mnist.py

Lines 56 to 59 in 7c44d35

    
           # InceptionScore: keep commented until the issues 
        
           # https://github.com/tensorflow/tensorflow/issues/28599 
        
           # https://github.com/tensorflow/hub/issues/295 
        
           # Haven't been solved and merged into tf2

Have these been solved? If maybe let's ping the TF team again.

[FEATURE] - Logging instead of print

System information

AshPy version (you are using): 0.4.0
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state

Currently, we use a combination of print() and tf.print() to handle console output. Unifying them via a standardized logging could increase clarity and give the end-user control over the displayed messages via a logger while letting the developer have better control over debug, info, and warning messages..

Will this change the current API? How?

No change to it.

Who will benefit with this feature?

End Users
Developers

Setup Travis

I think we can setup Travis in order to automate tests and deployment. Travis can be used to automatic deployment on pypi in this way: https://docs.travis-ci.com/user/deployment/pypi/.

[DOC] - Customizing Ashpy Examples

Add documentation and examples for the implementation of custom Metric.
- How to convert a keras.metrics.Metric into an Ashpy one?
Add documentation and examples for the implementation of custom Callbacks.

[FEATURE] - Add colab examples

Describe the feature and the current behavior/state

Right now there are no examples "click and go", to run on Google Colab. There are no notebooks that explain and show how to use AshPy and why this solution should be better or more useful than using the pure Keras API or other solutions.

Will this change the current api? How?

No, it doesn't.

Who will benefit with this feature?

Everyone. Since, OK examples in the documentation and in the README are cool, but having some colab notebook is more powerful and easy to share (moreover, it's more catchy and people love executing cells and learning)

No such file or directory checkpoint_map.json when restoring model from model selection

The title says pretty much everything.

Creating a restorer (in particular a ClassifierRestorer but it is not important) passing as checkpoint directory, the directory where the checkpoint of the best model has been saved causes an error.

two options:

We have to create the same file also when doing model selection
We have to make it possible to restore the model from the checkpoint even if the file is not present

I like the first one

[BUG] - Restorer does not work with distributed training

Describe the bug
Run examples/gan/pix2pix_facades_multi_gpu.py in a multi-gpu scenario. If you try to restore the training once finished you get an error due to wrong input shapes.
This is because in a multi-gpu scenario the batch size gets updated based on the number of devices.
Simply move the call to build_or_restore after the batch size update.

Expected behavior
The restorer should restore the models.

Code to reproduce the issue
examples/gan/pix2pix_facades_multi_gpu.py

[FEATURE] - Add examples to test suite

System information

AshPy version (you are using): 1.0.2
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state
Currently examples are not tested. We can think to add examples in the test suite or in a separate stage of travis in order to check that every example still work after some changes.

Will this change the current api? How?
We need to change the travis.yml configuration.

Who will benefit with this feature?
Everyone.

[DOC] - Document Conditional Training in AshPy

Description of issue (what needs changing):

AshPy supports conditional training.
See Facades Example

Document better this type of training. Maybe with notebooks :)

Model selection JSON is overwritten on train restart

Scenario: define a training process, with a metric used for model selection. Then:

Train a model for N epochs, stop the training, look the best/file.json -> it contains values
Restart the training (setting the number of epochs to something greater than N): the `fil.json' is overwritten with the default, zero values.

Example:

I trained a model, and I got this JSON with the values:

{
    "AEAccuracy": "0.79296064",
    "step": "8370",
    "positive_threshold": "0.019780229777097702",
    "positive_variance": "4.077112680533901e-05",
    "negative_threshold": "0.0740804448723793",
    "negative_variance": "4.077112680533901e-05"
}

After restarting the train, the file is overwritten and reset to the default values of

{
    "AEAccuracy": "-inf",
    "step": "0",
    "positive_threshold": "0.0",
    "positive_variance": "0.0",
    "negative_threshold": "0.0",
    "negative_variance": "0.0"
}

Thus, the model selection process starts again, but we will for sure overwrite the previous best model.

[FEATURE] - Update to Python 3.8

System information

AshPy version (you are using): 0.4.0
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state

Currently, we were stuck using Python 3.7 due to TensorFlow not supporting 3.8.
Python 3.8 should now be supported by TF. AshPy can make the switch too.

Will this change the current api? How?

It should not create any breaking changes unless we start using 3.8 exclusive features.

Who will benefit with this feature?

All users?

[BUG] - Restorer does not work with GANs conditioned

Describe the bug
Run examples/gans/facades.py, stop the training after some time and then try to restart the training. You get an error due to the fact that the discriminator expects a list of two tensors.

Expected behavior
The training should restart with models restored correctly.

Code to reproduce the issue
Simply run examples/gans/facades.py.

Other info / logs
Inside build_and_restore we should check the size of discriminator inputs (and also the size of generator inputs).

[DOC] - ashpy.utils is not part of the API reference

Self explanatory ashpy.utilsis missing from the API reference.

[FEATURE] - Add a flag for correct scaling of logged TensorBoard images

System information

AshPy version (you are using): 0.4.0
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state
TensorBoard images should be logged in the [0, 1] range when float.
Currently, we do not provide any type of automatic handling of this scaling, so if you are training on [-1, 1] and using our callbacks, images will be logged without scaling.

Will this change the current API? How?
Small change to ashpy.utils.log() and all the callbacks using it, they will now accept a new boolean argument auto_tb_scaling controlling this behavior.

Who will benefit with this feature?
Anyone using the default logging facilities.

Any Other info

Dependency Updater

Maybe we can setup a dependency updater like: https://github.com/pyupio/pyup

[BUG] - Context are not checkpointable; no optimizer in Context; global_step initilization not OK

The file src/ashpy/contexts/context.py contains the following docstring

r"""
 Primitive Context Interface.
 
 ``Contexts`` are checkpointable (subclassed from :py:class:`tf.train.Checkpoint`)
 collections of variable encapsulated in a Python Class as a way to seamlessly
 handle information transfer.
 """

but the Context class inherits from Object, not from tf.train.Checkpoint.

Also, there is no way to access the optimizer from the Context object.

Moreover, we initialize in the constructor (thus, during the declaration since tf.Variable are mutable objects) the global_step with tf.Variable(0, name="global_step", trainable=False, dtype=tf.int64). This must be avoided since a context is created always from (magic) a well-defined context - therefore we can initialize this to None

Hi! Why didn't train for self._discriminator?

    real_x, real_y = real_xy

    if len(self._generator.inputs) == 2:
        g_inputs = [g_inputs, real_y]

    with tf.GradientTape(persistent=True) as tape:
        fake = self._generator(g_inputs, training=True)
        logits_fake = self._discriminator(fake)  #why didn't add this and next line?
        logits_real = self._discriminator(real_y)
        d_loss = self._d_loss(
            self._context, fake=fake, real=real_x, condition=real_y, training=True
        )

        g_loss = self._g_loss(
            self._context, fake=fake, real=real_x, condition=real_y, training=True
        )

    # check that we have some trainable_variables
    assert self._generator.trainable_variables
    assert self._discriminator.trainable_variables

    # calculate the gradient
    d_gradients = tape.gradient(d_loss, self._discriminator.trainable_variables)
    g_gradients = tape.gradient(g_loss, self._generator.trainable_variables)

	TODO: add test case for the Sequential without input shape
	"""
	try:
	if restored_model.weights == []:
	raise ModelNotConstructedError
	except AttributeError:

	TODO: test a callback that retrieves a Metric from its Context
	TODO: Restructure these tests
	"""

	TODO: add docs for the exception.
	TODO: add test case for the Sequential without input shape
	"""
	try:
	if restored_model.weights == []:
	raise ModelNotConstructedError

	# InceptionScore: keep commented until the issues
	# https://github.com/tensorflow/tensorflow/issues/28599
	# https://github.com/tensorflow/hub/issues/295
	# Haven't been solved and merged into tf2

zurutech / ashpy Goto Github PK

ashpy's People

Contributors

Stargazers

Watchers

Forkers

ashpy's Issues

This issue was generated by todo based on a TODO comment in 7efd7dd when #44 was merged. cc @mr-ubik.

URL(s) with the issue:

Description of issue (what needs changing):

URL(s) with the issue:

Description of the issue (what needs changing):

This issue was generated by todo based on a TODO comment in 7efd7dd when #44 was merged. cc @mr-ubik.

This issue was generated by todo based on a TODO comment in 7efd7dd when #44 was merged. cc @mr-ubik.

URL(s) with the issue:

Description of issue (what needs changing):

This issue was generated by todo based on a TODO comment in 7efd7dd when #44 was merged. cc @mr-ubik.

URL(s) with the issue:

Description of issue (what needs changing):

Description of issue (what needs changing):

Recommend Projects

Recommend Topics

Recommend Org

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.

This issue was generated by todo based on a `TODO` comment in `7efd7dd` when #44 was merged. cc @mr-ubik.