zurutech / ashpy Goto Github PK
View Code? Open in Web Editor NEWTensorFlow 2.0 library for distributed training, evaluation, model selection, and fast prototyping.
Home Page: https://ashpy.zurutech.io/
License: Apache License 2.0
TensorFlow 2.0 library for distributed training, evaluation, model selection, and fast prototyping.
Home Page: https://ashpy.zurutech.io/
License: Apache License 2.0
Scenario:
The JSON file of the model selection is overwritten with default values, making you lost the previously (correctly stored) value.
https://ashpy.zurutech.io/en/latest/_modules/ashpy/models/gans.html#Generator
The source code page can be enlarged in order to see the whole code
The title says pretty much all.
If I want to measure the performance of the model only at the end of every epoch, the (for example) ClassifierTrainer offers me the possibility of setting a value for the logging frequency to something <= 0
.
In this way, I measure the performance at the end of every epoch on the validation set (it works).
However, on tensorboard I only see the plot of the validation curves, the training curves aren't displayed anymore
Describe the bug
I expect to be able to do model selecting, using the classifier loss as metric. Actually, I find only a JSON with this content:
cat ~/log/on/best/loss/loss.json
{
"loss": "-inf",
"step": "0"
}
Expected behavior
I should find in the logdir, the best folder containing the JSON + the checkpoint files of the best model, wrt the chosen metric.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
def get_model():
"""Create a new autoencoder tf.keras.Model."""
autoencoder = Autoencoder(
(64, 64),
(4, 4),
kernel_size=3,
initial_filters=16,
filters_cap=64,
encoding_dimension=50,
channels=3,
)
# encoding, representation = autoencoder(input)
inputs = tf.keras.layers.Input(shape=(64, 64, 3))
_, reconstruction = autoencoder(inputs)
model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
return model
def _train(dataset: tf.data.Dataset, logdir: Path):
reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
autoencoder = get_model()
ClassifierTrainer(
model=autoencoder,
optimizer=tf.optimizers.Adam(1e-4),
loss=reconstruction_error,
metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
logdir=str(logdir),
epochs=100,
)(dataset, dataset)
In short, the initialization of the JSON is wrong when I want to select a model whose metric is decreasing.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub.
https://ashpy.readthedocs.io/en/latest/_modules/ashpy/models/convolutional/unet.html#UNet
Make it point to https://arxiv.org/abs/1611.07004
See the following log:
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[960] loss: 0.04510442167520523
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[970] loss: 0.04019254446029663
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[980] loss: 0.03933567926287651
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[990] loss: 0.03875984624028206
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1000] loss: 0.03433336690068245
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1010] loss: 0.041549138724803925
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1020] loss: 0.040606431663036346
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
[1030] loss: 0.041963666677474976
AEAccuracy: validation value: 0.49551415 → 0.4955141544342041
As you can see, this is a constant metric, that gets updated every time when model selection is performed, because we compare the saved value with the newly computed value, which is identical but with some decimal digit.
We should fix the number of digits we want to take into consideration
The default value for the processing_predictions
argument of ashpy.metrics.ClassifierMetric
does lead to some issues when kept at default while working with metrics such as FBetaScore, Precision and Recall.
StrategyBase.experimental_run_v2 (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
renamed to `run`
Describe the bug
The title describes the bug.
Expected behavior
If I ask ashpy to use a logdir outside of the cwd, nothing in the cwd should change.
Instead, currently, the JSON of the best model is created in the local directory.
Code to reproduce the issue
def get_model():
"""Create a new autoencoder tf.keras.Model."""
autoencoder = Autoencoder(
(64, 64),
(4, 4),
kernel_size=3,
initial_filters=16,
filters_cap=64,
encoding_dimension=50,
channels=3,
)
# encoding, representation = autoencoder(input)
inputs = tf.keras.layers.Input(shape=(64, 64, 3))
_, reconstruction = autoencoder(inputs)
model = tf.keras.Model(inputs=inputs, outputs=reconstruction)
return model
def _train(dataset: tf.data.Dataset, logdir: Path):
reconstruction_error = ClassifierLoss(tf.keras.losses.MeanSquaredError())
autoencoder = get_model()
ClassifierTrainer(
model=autoencoder,
optimizer=tf.optimizers.Adam(1e-4),
loss=reconstruction_error,
metrics=[ashpy.metrics.ClassifierLoss(model_selection_operator=operator.lt)],
logdir=str(logdir),
epochs=100,
)(dataset, dataset)
The logdir can be defined in trainers and in metrics.
The actual behaviour is to override the metric's logdir using the trainer's logdir.
However, the logdir parameter in trainer is optional and it has a default value (cwd + "log"
).
logdir = "mylogdir"
precision = ClassifierMetric(
metric=tf.keras.metrics.Precision(),
model_selection_operator=operator.gt,
logdir=logdir,
)
trainer = ClassifierTrainer(
model=self.model,
optimizer=optimizer,
loss=loss,
epochs=epochs,
metrics=[precision],
callbacks=callbacks,
)
trainer(
self.train_dataset.batch(batch_size).prefetch(1),
self.validation_dataset.batch(batch_size).prefetch(1),
)
Ashpy logs in the directory "log" instead of the directory "mylogdir"
.
Possible solution: remove logdir
from the metric's __init__
and set the logdir
from the trainer.
[Optional] remove logdir
default values.
If anyone finds a typo or any small error in the docs, please, report it here. Thanks ❤️
Badges should link not to the image but to something relevant: i.e. clicking on package: ashpy should take you to the pypi ashpy page.
System information
Describe the bug
No easy way to specify a name for subclassed custom ClassifierMetric
Expected behavior
While subclassing ClassifierMetric
it should be possible to specify a name for the metric.
Add either to the docs or to a separate wiki a list with projects (but also articles, tutorials, etc...) using our library.
https://ashpy.zurutech.io/en/latest/index.html
The logo can be added to the sidebar like in the keras doc: https://keras.io/
Right now the only way to correctly restore the content of the checkpoint of the best model is to re-create a trainer, and even if there is no need to train the model, re-instantiate everything in order to be able to restore the model.
Moreover, if we don't specify certain elements of the trainer, we got warnings like
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
Thus, a better way of restoring only the best model fro the checkpoint (created from the model selection, is needed).
Line:
Line 266 in f053532
The loss is calculated as - tf.nn.relu(d_fake)
However, as in SA-GAN:
https://github.com/brain-research/self-attention-gan/blob/ad9612e60f6ba2b5ad3d3340ebae60f724636d75/model.py#L73
we should only minimize - d_fake
https://ashpy.readthedocs.io/en/latest/_autosummary/models/ashpy.models.gans.html#
No documentations, links to varioous GANs's components are broken.
Lines 56 to 59 in 7c44d35
Have these been solved? If maybe let's ping the TF team again.
System information
0.4.0
Describe the feature and the current behavior/state
Currently, we use a combination of print()
and tf.print()
to handle console output. Unifying them via a standardized logging
could increase clarity and give the end-user control over the displayed messages via a logger
while letting the developer have better control over debug, info, and warning messages..
Will this change the current API? How?
No change to it.
Who will benefit with this feature?
I think we can setup Travis in order to automate tests and deployment. Travis can be used to automatic deployment on pypi in this way: https://docs.travis-ci.com/user/deployment/pypi/.
Metric
.
keras.metrics.Metric
into an Ashpy one?Callbacks
.Describe the feature and the current behavior/state
Right now there are no examples "click and go", to run on Google Colab. There are no notebooks that explain and show how to use AshPy and why this solution should be better or more useful than using the pure Keras API or other solutions.
Will this change the current api? How?
No, it doesn't.
Who will benefit with this feature?
Everyone. Since, OK examples in the documentation and in the README
are cool, but having some colab notebook is more powerful and easy to share (moreover, it's more catchy and people love executing cells and learning)
The title says pretty much everything.
Creating a restorer (in particular a ClassifierRestorer but it is not important) passing as checkpoint directory, the directory where the checkpoint of the best model has been saved causes an error.
two options:
I like the first one
Describe the bug
Run examples/gan/pix2pix_facades_multi_gpu.py
in a multi-gpu scenario. If you try to restore the training once finished you get an error due to wrong input shapes.
This is because in a multi-gpu scenario the batch size gets updated based on the number of devices.
Simply move the call to build_or_restore
after the batch size update.
Expected behavior
The restorer should restore the models.
Code to reproduce the issue
examples/gan/pix2pix_facades_multi_gpu.py
System information
Describe the feature and the current behavior/state
Currently examples are not tested. We can think to add examples in the test suite or in a separate stage of travis in order to check that every example still work after some changes.
Will this change the current api? How?
We need to change the travis.yml configuration.
Who will benefit with this feature?
Everyone.
AshPy supports conditional training.
See Facades Example
Document better this type of training. Maybe with notebooks :)
Scenario: define a training process, with a metric used for model selection. Then:
best/file.json
-> it contains valuesExample:
I trained a model, and I got this JSON with the values:
{
"AEAccuracy": "0.79296064",
"step": "8370",
"positive_threshold": "0.019780229777097702",
"positive_variance": "4.077112680533901e-05",
"negative_threshold": "0.0740804448723793",
"negative_variance": "4.077112680533901e-05"
}
After restarting the train, the file is overwritten and reset to the default values of
{
"AEAccuracy": "-inf",
"step": "0",
"positive_threshold": "0.0",
"positive_variance": "0.0",
"negative_threshold": "0.0",
"negative_variance": "0.0"
}
Thus, the model selection process starts again, but we will for sure overwrite the previous best model.
System information
0.4.0
Describe the feature and the current behavior/state
Currently, we were stuck using Python 3.7
due to TensorFlow not supporting 3.8
.
Python 3.8 should now be supported by TF. AshPy can make the switch too.
Will this change the current api? How?
It should not create any breaking changes unless we start using 3.8 exclusive features.
Who will benefit with this feature?
All users?
Describe the bug
Run examples/gans/facades.py
, stop the training after some time and then try to restart the training. You get an error due to the fact that the discriminator expects a list of two tensors.
Expected behavior
The training should restart with models restored correctly.
Code to reproduce the issue
Simply run examples/gans/facades.py
.
Other info / logs
Inside build_and_restore
we should check the size of discriminator inputs (and also the size of generator inputs).
Self explanatory ashpy.utils
is missing from the API reference.
System information
Describe the feature and the current behavior/state
TensorBoard images should be logged in the [0, 1] range when float.
Currently, we do not provide any type of automatic handling of this scaling, so if you are training on [-1, 1] and using our callbacks, images will be logged without scaling.
Will this change the current API? How?
Small change to ashpy.utils.log()
and all the callbacks using it, they will now accept a new boolean argument auto_tb_scaling
controlling this behavior.
Who will benefit with this feature?
Anyone using the default logging facilities.
Any Other info
Maybe we can setup a dependency updater like: https://github.com/pyupio/pyup
The file src/ashpy/contexts/context.py
contains the following docstring
r"""
Primitive Context Interface.
``Contexts`` are checkpointable (subclassed from :py:class:`tf.train.Checkpoint`)
collections of variable encapsulated in a Python Class as a way to seamlessly
handle information transfer.
"""
but the Context
class inherits from Object, not from tf.train.Checkpoint
.
Also, there is no way to access the optimizer from the Context
object.
Moreover, we initialize in the constructor (thus, during the declaration since tf.Variable
are mutable objects) the global_step
with tf.Variable(0, name="global_step", trainable=False, dtype=tf.int64)
. This must be avoided since a context is created always from (magic) a well-defined context - therefore we can initialize this to None
real_x, real_y = real_xy
if len(self._generator.inputs) == 2:
g_inputs = [g_inputs, real_y]
with tf.GradientTape(persistent=True) as tape:
fake = self._generator(g_inputs, training=True)
logits_fake = self._discriminator(fake) #why didn't add this and next line?
logits_real = self._discriminator(real_y)
d_loss = self._d_loss(
self._context, fake=fake, real=real_x, condition=real_y, training=True
)
g_loss = self._g_loss(
self._context, fake=fake, real=real_x, condition=real_y, training=True
)
# check that we have some trainable_variables
assert self._generator.trainable_variables
assert self._discriminator.trainable_variables
# calculate the gradient
d_gradients = tape.gradient(d_loss, self._discriminator.trainable_variables)
g_gradients = tape.gradient(g_loss, self._generator.trainable_variables)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.