<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training loss about mxnet_pose_for_ai_challenger HOT 19 OPEN

dragonfly90 commented on August 24, 2024

Training loss

from mxnet_pose_for_ai_challenger.

Comments (19)

qqsh0214 commented on August 24, 2024

@dragonfly90 I find that my previous error "simple bind error" is caused by mxnet with cpu version. I didn't realize it before. So I just start training from beginning. I will try larger batch_size.
By the way, I'm trying Mask R-CNN by He Kaiming for pose estimation. If you are intersted in it, we can talk later.

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 Maybe you could try batch_size = 10. It seems to work well for coco dataset. I am interesting in Mask R-CNN, too. Which version of mask rcnn do you want to try? I am trying to implement feature pyramid head in Mask rcnn in mxnet.

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 OK, I will try batch_size=10. I refer to section 5 in https://arxiv.org/abs/1703.06870. I am trying implement mask rcnn for pose estimation based on faster rcnn in mxnet https://github.com/precedenceguo/mx-rcnn

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 . Cool! Then we may work together. How is it going? I could work on the feature pyramid head first. I have wrote some code there. You could work on mask and ROIalign first if you would like to.

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 I have worked on mask. I will try ROIAlign based on ROIPooling in mxnet with C++ source code. But I am a little confused about how to change codes with data IO for training on pose.Do you have any ideas?

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 I think we could first use the ground truth bounding box to train region proposal network, then get human mask and then do keypoint regression in the mask. But I am not sure this is right. We may need to code and debug a lot. Did you get some result using cpm? I am short of GPU now. Our server is occupied by other tasks.

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 I don't get result because I get the error：
ValueError: Too many slices. Some splits are empty. and the training is terminated at around 4500 iterators. We have GPU available.

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 I don't know what is this kind of error, code issue, could you figure out which image cause this error? I am using others' computer to train the validation dataset because it is small than the train dataset(200k images if I am right). Are you using training dataset? Maybe we could talk tomorrow?

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 It is trained for a day and terminate in the morning today. I don't know which image causes this error. I trained the training dataset and it is 210K. We can have a talk tomorrow.

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 Did you fix the bug? My training has some result. It seems to work well on neck, but could not distinguish left and right shoulder or other symmetry parts.

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 I have not fixed the bug but I think it may be multi-gpu training and one of which don't work. I'm sorry that I go back home for the National's day recently. My work will be stopped for some days.

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 No problem. Have a good holiday! I will think about the problem.

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@dragonfly90 I update the evaluation code. You can have a look.
https://github.com/PoseAIChallenger/mxnet_pose_for_AI_challenger/blob/master/evaluate.py

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@qqsh0214 Cool, thank you.

from mxnet_pose_for_ai_challenger.

neilyoyoyoyo commented on August 24, 2024

@qqsh0214 @dragonfly90 I meet the same error while training on the 210k training images. Here is the training log:
iteration: 4518
start heat: 28.3897827148
start paf: 119.586230469
end heat: 28.4323547363
end paf: 119.595092773
Traceback (most recent call last):
File "TrainWeight.py", line 222, in
cmodel.fit(aidata, num_epoch = iteration, batch_size = batch_size, carg_params = newargs)
File "TrainWeight.py", line 120, in fit
cmodel.forward(data_batch, is_train=True) # compute predictions
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 594, in forward
self.reshape(new_dshape, new_lshape)
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 459, in reshape
self._exec_group.reshape(self._data_shapes, self._label_shapes)
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 348, in reshape
self.bind_exec(data_shapes, label_shapes, reshape=True)
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 310, in bind_exec
self.data_layouts = self.decide_slices(data_shapes)
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 255, in decide_slices
self.slices = _split_input_slice(self.batch_size, self.workload)
File "/usr/local/lib/python2.7/dist-packages/mxnet/executor_manager.py", line 64, in _split_input_slice
raise ValueError('Too many slices. Some splits are empty.')
ValueError: Too many slices. Some splits are empty.
Any suggestion about this bug now? Thx

from mxnet_pose_for_ai_challenger.

qqsh0214 commented on August 24, 2024

@neilyoyoyoyo One of the solution is that you can divide the training data into 5 parts with each less than 45000 images. And you also have to get 5 related data.json files.

from mxnet_pose_for_ai_challenger.

neilyoyoyoyo commented on August 24, 2024

@qqsh0214 well, it's useful, thank you

from mxnet_pose_for_ai_challenger.

neilyoyoyoyo commented on August 24, 2024

@dragonfly90 I may find why the above bug happens. In "TrainWeight.py", you set a break in the "next" function of class "AIChallengerIterweightBatch"
`
def next(self):
if self.cur_batch < self.num_batches:

        transposeImage_batch = []
        heatmap_batch = []
        pagmap_batch = []
        heatweight_batch = []
        vecweight_batch = []
        
        for i in range(batch_size):
            if self.cur_batch >= 45174:
                break
            image, mask, heatmap, pagmap = getImageandLabel(self.data[self.keys[self.cur_batch]])
            maskscale = mask[0:368:8, 0:368:8, 0]
            heatweight = np.ones((numofparts, 46, 46))
            vecweight = np.ones((numoflinks*2, 46, 46))

            for i in range(numofparts):
                heatweight[i,:,:] = maskscale

            for i in range(numoflinks*2):
                vecweight[i,:,:] = maskscale
            
            transposeImage = np.transpose(np.float32(image), (2,0,1))/256 - 0.5
        
            self.cur_batch += 1
            
            transposeImage_batch.append(transposeImage)
            heatmap_batch.append(heatmap)
            pagmap_batch.append(pagmap)
            heatweight_batch.append(heatweight)
            vecweight_batch.append(vecweight)
            
        return DataBatchweight(mx.nd.array(transposeImage_batch),
                               mx.nd.array(heatmap_batch),
                               mx.nd.array(pagmap_batch),
                               mx.nd.array(heatweight_batch),
                               mx.nd.array(vecweight_batch))
    else:
        raise StopIteration

`
I don't know what "45174" means and if possible you can fix it.

from mxnet_pose_for_ai_challenger.

dragonfly90 commented on August 24, 2024

@neilyoyoyoyo Thank you very much! I made a mistake there. I used the same code for Microsoft coco. And the number of training images is 45174.

from mxnet_pose_for_ai_challenger.

Training loss about mxnet_pose_for_ai_challenger HOT 19 OPEN

Comments (19)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent