Code Monkey home page Code Monkey logo

Comments (15)

jakeret avatar jakeret commented on July 17, 2024

Hi @surfreta
Sorry for the late reply.

  1. Yes, the implementation uses a one-hot encoding for the classes. That means you need to provide it as row*column*classes
  2. I use an unpadded (de)convolution, which reduces the size of the input. The 1000 is more or less arbitrary - it's just for convencinence to compute the offset size
  3. Using a batch size of 4 is also arbitrary - the resulting image just happend to have a convenient size to work with.

Hope it managed to reduce the confusion a bit
Best
jakeret

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi Joel,

Thank you so much for your reply!

Excuse me that I have some follow-up questions. Thank you again for helping me clarifying these confusions.

1) I read through the training code, it seems to me that the following code segment was not used at all. I have not been able to figure where to leverage the “avg_gradients”.
if avg_gradients is None:
avg_gradients = [np.zeros_like(gradient) for gradient in gradients]
for i in range(len(gradients)):
avg_gradients[i] = (avg_gradients[i] * (1.0 - (1.0 / (step+1)))) + (gradients[i] / (step+1))
norm_gradients = [np.linalg.norm(gradient) for gradient in avg_gradients]

My understanding is that batch optimization has already been handled in

_, loss, lr, gradients = sess.run((self.optimizer, self.net.cost, self.learning_rate_node, self.net.gradients_node), feed_dict={self.net.x: batch_x, self.net.y: util.crop_to_shape(batch_y, pred_shape), self.net.keep_prob: dropout})
2) In the demo program, you set layers =3 and features_root=16. According to the paper, features_root=64, layers=5. Is that right? Just would like to confirm.

3) Currently, I am trying to run this code against the kaggle ultrasound segmentation set, which has about 5635 pairs of images. In the training process, I set batch size as 20. In other words, within each epoch, there can have about 280 iterations.
The training process just finished two epochs. I have several observations:

  1. **At least at this early stage the minibatch loss is very high, accuracy is very low.
  2. I can see minibatch loss is stably decreasing. However, the training accuracy and minibatch error is kind of volatile. I am not quite understand how to explain this?
  3. I use momentum optimizer as you did in the demo problem, but the learning rate keeps the same as shown in epoch 0 and epoch 1. According to the code, it should be decreased following each epoch. I really confused here.**

Here are some screenshots. Should you can share any insight and suggestions on this kind of result, I will be greatly appreciated. Looks to me that the result shown here is much worse than the ones for the toy data set you posted. I understand that maybe because of the dataset itself. Will there be any other possible reasons, or the ways to improve the result. Thank you very much.

`Iter 260, Minibatch Loss= 0.9573, Training Accuracy= 0.0064, Minibatch error= 99.4%
Iter 262, Minibatch Loss= 0.9571, Training Accuracy= 0.0026, Minibatch error= 99.7%
Iter 264, Minibatch Loss= 0.9568, Training Accuracy= 0.0323, Minibatch error= 96.8%
Iter 266, Minibatch Loss= 0.9565, Training Accuracy= 0.0264, Minibatch error= 97.4%
Iter 268, Minibatch Loss= 0.9563, Training Accuracy= 0.0355, Minibatch error= 96.5%
Iter 270, Minibatch Loss= 0.9560, Training Accuracy= 0.0048, Minibatch error= 99.5%
Iter 272, Minibatch Loss= 0.9557, Training Accuracy= 0.0096, Minibatch error= 99.0%
Iter 274, Minibatch Loss= 0.9555, Training Accuracy= 0.0072, Minibatch error= 99.3%
Iter 276, Minibatch Loss= 0.9552, Training Accuracy= 0.0162, Minibatch error= 98.4%
Iter 278, Minibatch Loss= 0.9549, Training Accuracy= 0.0217, Minibatch error= 97.8%
Epoch 0, Average loss: 0.0514, learning rate: 0.2000

name is epoch_0
Verification error= 96.1%, loss= 0.9593

the following ones are near the end of epoch 1

Iter 536, Minibatch Loss= 0.9227, Training Accuracy= 0.0053, Minibatch error= 99.5%
Iter 538, Minibatch Loss= 0.9225, Training Accuracy= 0.0030, Minibatch error= 99.7%
Iter 540, Minibatch Loss= 0.9222, Training Accuracy= 0.0064, Minibatch error= 99.4%
Iter 542, Minibatch Loss= 0.9220, Training Accuracy= 0.0026, Minibatch error= 99.7%
Iter 544, Minibatch Loss= 0.9218, Training Accuracy= 0.0323, Minibatch error= 96.8%
Iter 546, Minibatch Loss= 0.9215, Training Accuracy= 0.0264, Minibatch error= 97.4%
Iter 548, Minibatch Loss= 0.9213, Training Accuracy= 0.0355, Minibatch error= 96.5%
Iter 550, Minibatch Loss= 0.9211, Training Accuracy= 0.0048, Minibatch error= 99.5%
Iter 552, Minibatch Loss= 0.9208, Training Accuracy= 0.0096, Minibatch error= 99.0%
Iter 554, Minibatch Loss= 0.9206, Training Accuracy= 0.0072, Minibatch error= 99.3%
Iter 556, Minibatch Loss= 0.9204, Training Accuracy= 0.0162, Minibatch error= 98.4%
Iter 558, Minibatch Loss= 0.9201, Training Accuracy= 0.0217, Minibatch error= 97.8%
Epoch 1, Average loss: 0.0466, learning rate: 0.2000
name is epoch_1
Verification error= 96.1%, loss= 0.9245
`

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

I'm glad if I can help you.

  1. you are correct, the optimization is handled in this line. The gradients are computed for debugging purpose, which might be helpful in your particular case. You can visualize them (among other summaries) with the Tensorboard. If the gradients have "weird" values it might indicate that something is not quite right with the net or maybe with the training data.

  2. In the original U-Net paper the used features_root=64, layers=5, correct. In my paper I used different network configs whereby I got good results with a simpler network (32/3).

  3. I'm not sure if it's a good idea that the network is seeing all the training data in every epoch. In my case with the radio data, I kept the batch_size at the default value (1) and training_iters=32. Instead I use many more epochs (100). The latter is particularly important because the learning rate is exponetially decreased over the epochs.

Generally, after 1-2 epoch it might be to early to make conclusions about the learning process (fluctuations within epochs are expected). Enable the summaries and use the tensorboard to check if everything is working as expected (learning rate and loss decreasing, accuracy increasing, activations are comparable among the convolution layers, gradients look sane, the image summaries look correct etc.). I would also try to start with a smaller network as the one in the original paper. In my case it was relatively hard to train that network (the regularizers helped a bit).

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi joel,

Really appreciate your reply! Just one more question,

I use momentum optimizer as you did in the demo problem, but the learning rate keeps the same as shown in epoch 0 and epoch 1. According to the code, it should be decreased following each epoch. I am really confused here.

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

Yeah me too ;-)
Do you see the same behavior when running the toy problem in the Jupyter notebook?

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi Joel,

When running the toy program, looks like the rate is keep changing as you posted on the Web.

But for running the ultrasound set, the learning rate keeps as 0.2 for the first four epochs, which have been finished so far.

These are the epoch evaluations result for the first several epochs. Looks like average loss is keep decreasing while the verification error keeps the same. Does it hint that the training is working in terms of the loss, but why verification error does not change at all. Moreover, there is no change for the learnign rate.

`Verification error= 91.0%, loss= 4.1180
start optimization

Epoch 0, Average loss: 0.1933, learning rate: 0.2000
Verification error= 91.0%, loss= 3.6706

Epoch 1, Average loss: 0.1724, learning rate: 0.2000
Verification error= 91.0%, loss= 3.2816

Epoch 2, Average loss: 0.1543, learning rate: 0.2000
Verification error= 91.0%, loss= 2.9434

Epoch 3, Average loss: 0.1387, learning rate: 0.2000
Verification error= 91.0%, loss= 2.6494

Epoch 4, Average loss: 0.1250, learning rate: 0.2000
Verification error= 91.0%, loss= 2.3938`

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

That the loss is decreasing is the most important. However that the
learning rate is not changing is very weird and I don't really have an
explaination for that. Are you using the momentum optimizer? Have you
adapted some of the optimizer settings?

After each epoch a prediciton of the current network is written to disk in
the "prediction" folder. Are those looking reasonable (from left to right:
input, ground truth, prediciton)?

That the verification error is remaining constant is also a bit odd but I
wouldn't worry about this too much at the moment
On Nov 5, 2016 9:40 PM, "surfreta" [email protected] wrote:

Hi jakeret,

When running the toy program, looks like the rate is keep changing as you
posted on the Web.

But for running the ultrasound set, the learning rate keeps as 0.2 for the
first four epochs.

These are the epoch evaluations result for the first several epochs. Looks
like average loss is keep decreasing while the verification error keeps the
same. Does it hint that the training is working in terms of the loss, but
why verification error does not change at all. Moreover, there is no change
for the learnign rate.

`Verification error= 91.0%, loss= 4.1180

start optimization

Epoch 0, Average loss: 0.1933, learning rate: 0.2000

Verification error= 91.0%, loss= 3.6706

Epoch 1, Average loss: 0.1724, learning rate: 0.2000

Verification error= 91.0%, loss= 3.2816

Epoch 2, Average loss: 0.1543, learning rate: 0.2000

Verification error= 91.0%, loss= 2.9434

Epoch 3, Average loss: 0.1387, learning rate: 0.2000

Verification error= 91.0%, loss= 2.6494

Epoch 4, Average loss: 0.1250, learning rate: 0.2000
Verification error= 91.0%, loss= 2.3938`


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALSFv1LItLSFxAmq1BuSq4fuGl__-w8_ks5q7Om9gaJpZM4Kh0-x
.

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi Joel,

This is how I invoke "train" from launcher.py, which is the same as yours

net = unet.Unet(channels=1, n_class=2, layers=5, features_root=64) trainer = unet.Trainer(net, optimizer="momentum", opt_kwargs=dict(momentum=0.2))

This is how I modify the train part in unet.py

x_train,mask_train= testimageinput.load_train_data()

# x_train and mask_train is used to store the tensor for both raw image and mask image. x_train is of 5635*row*column*1, mask_train is of 5635*row*column*1

with tf.Session() as sess:
            sess.run(init)
        if restore:
            ckpt = tf.train.get_checkpoint_state(output_path)
            if ckpt and ckpt.model_checkpoint_path:
                self.net.restore(sess, ckpt.model_checkpoint_path)


        test_x = x_train[0,:,:,:]            
        test_x = test_x[np.newaxis]
        test_y = mask_train[0,:,:,:]            
        test_y = test_y[np.newaxis]


        pred_shape = self.store_prediction(sess, test_x, test_y, "_init")

        summary_writer = tf.train.SummaryWriter(output_path, graph=sess.graph)
        print("Start optimization")


        for epoch in range(epochs):
            total_loss = 0
            ij = 0

            new_batch_size = 20  
            new_training_iters = int(5635/new_batch_size)-1  # there are 5635 pairs of training data, "-1" just make sure the program does not go beyond the range

            for step in range((epoch*new_training_iters), ((epoch+1)*new_training_iters)):
                batch_x = x_train[ij:ij+20,:,:,:]
                batch_y = mask_train[ij:ij+20,:,:,:]

                # Run optimization op (backprop)
                _,loss,lr,*gradients = sess.run([self.optimizer, self.net.cost, self.learning_rate_node]+self.net.gradients_node,feed_dict={self.net.x: batch_x,self.net.y: util.crop_to_shape(batch_y, pred_shape),self.net.keep_prob: dropout})


                if step % display_step == 0:
                    self.output_minibatch_stats(sess, summary_writer, step, batch_x, util.crop_to_shape(batch_y, pred_shape))

                total_loss += loss
                ij=ij+20

            self.output_epoch_stats(epoch, total_loss, training_iters, lr)
            self.store_prediction(sess, test_x, test_y, "epoch_%s"%epoch)

            save_path = self.net.save(sess, save_path)`

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi Joel, since I could not post figure here. I uploaded the epoch figures on the dropbox, and have sent you a link.

Looks like all the predicted figure(the right most one) are black in the first several epochs. Does it indicate that the model is far from converging after first several epochs, even though the loss keeps decreasing.

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

I'm looking at your code on my mobile, so maybe I'm missing something.
I don't really get why you had to reimplement the method of the Trainer
class. The implementation is so generic that you only have to write your
'data_provider' for your specific data set and everything else remains the
same.
On Nov 7, 2016 1:32 AM, "surfreta" [email protected] wrote:

Hi Joel, since I could not post figure here. I uploaded the epoch figures
on the dropbox, and have sent you a link. It kind of confuses me since they
all look the same, I mean between different epochs.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALSFv_RRsHRx8xRIVDhd6o2AQ2RArSEeks5q7nGJgaJpZM4Kh0-x
.

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

The black prediciton indicates a problem (could be that all predicted
probabilities are all the same in the entire image). Do you get the same
result with a simple net e.g. 3 layers?
On Nov 7, 2016 1:32 AM, "surfreta" [email protected] wrote:

Hi Joel, since I could not post figure here. I uploaded the epoch figures
on the dropbox, and have sent you a link. It kind of confuses me since they
all look the same, I mean between different epochs.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALSFv_RRsHRx8xRIVDhd6o2AQ2RArSEeks5q7nGJgaJpZM4Kh0-x
.

from tf_unet.

surfreta avatar surfreta commented on July 17, 2024

Hi Joel,

Yes, you are right, I will re-factor my code to make it more readable. The only part I changed is the segment of reading batch data.

At the very beginning, I used the smaller network, and get the same result.

Right now, one of my concerns is about casting the ultrasound data set to the format of your program. For the ultrasound data set, both the raw image and the mask image is of shape (420,580,1). For each mask image, I just compute an extra dimension as mask_image_background =1-mask_image Then I concatenate this one is mask_image to ensure they have the form of (420,580,2), which can be directly used by your program. Is this the correct approach? Thanks.

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

I was seeing a similar behavior when Tensorflow was failing to optimize the parameters. In my case I started with a simple network to find the issue.
What also helped was to normalize the data e.g. (from scripts/radio_util.py):

def _load_data_and_label(self):
    data, rfi = self._next_chunck()
    nx = data.shape[1]
    ny = data.shape[0]

    #normalization
    data = np.clip(np.fabs(data), self.a_min, self.a_max)
    train_data = data.reshape(1, ny, nx, 1)
    train_data -= np.amin(data)
    train_data /= np.amax(data)
    labels = np.zeros((1, ny, nx, 2), dtype=np.float32)
    labels[..., 1] = rfi
    labels[..., 0] = ~rfi
    return train_data, labels

from tf_unet.

wenouyang avatar wenouyang commented on July 17, 2024

Hi Joel,

Thanks for the input.

With respect to debugging from the simple network, would you like to elaborate more? Or what can be the issues? I kind of lack clue, looks like generally people
will reduce the batch size, change the optimization algorithm? Can you share more insight about your ways for finding the issue?

Regarding normalizing the data, what does your code labels[..., 1] = rfi, labels[..., 0] = ~rfi aim to do? Does it try to achieve the same functionality as I did, mask_image_background =1-mask_image.
Especially what does labels[..., 0] = ~rfi aim to do? I am not very clear about the usage of ~rfi, Thanks.

from tf_unet.

jakeret avatar jakeret commented on July 17, 2024

I reduced the size of the unet to 2 or 3 layers with a filter size of 16. You can also try to reduce the size of the input image a bit.

The momentum optimizer is generally not a bad choice for debugging due to its simplicity. Possibly you can increase the momentum for a better parameter space exploration (e.g. to 0.9).

Another thing to do is making sure that the training data has a good quality (similar amout of pixels of the different classes, no pixels with big outlier values etc.) and of course that it has the correct format. See my radio data example

As for the code:
rfi is a binary array where true values say that a pixel contains RFI and false if its background.
labels is the array with the one-hot encoding of the two class labels.

labels[..., 1] contains the same information as rfi (true if contaminated).
labels[..., 0] on the other hand is true if a pixel is just background (~ is the binary invert operator in python)

from tf_unet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.