eriklindernoren / pytorch-yolov3 Goto Github PK

View Code? Open in Web Editor NEW

7.2K 7.2K 2.6K 16.88 MB

Minimal PyTorch implementation of YOLOv3

License: GNU General Public License v3.0

Python 88.42% Shell 11.58%

pytorch-yolov3's People

Contributors

Stargazers

Watchers

Forkers

hbcbh1999 borislestsov wanjinchang gzzgz fengshikun barbecacov kuyun-zhangyang watkyns templeblock g-wang locussam handsomeboy chenhongming duke24k catalinolaru1 shlpu dzuwhf cxz qianwangn xzf125244170 mrwhitehomeman felixmonkey wpf535236337 hajungong007 dineshresearch dantodor izhangh 0asa nationalflag mahlermozart wjj19950828 tonyle9 qitong e-sha yuechengyin ariyanbinlayek insigh yuxuan96 steve-goley lupotto wangshuaixian lonestar686 russel-sht tboquet sunshiding omarhamdoun lengweiping1983 youngstu dansyu cisak xiaojinu alexandru-dinu happog bikong2 tikyau vancause uptodiff id9502 jiangce0810 zhangjunyi1225054736 ricardozzf haiminzhang codezero00 yutingkevinlai wohuajun xllau rosyapril fengziggg hxsong-github optimal16 grohith327 lyztyj alnlll kelvinson hellolt xtanitfy menggangmark mark-hoffmann eddiekimdev jzkay12 kssk16 zhangyuancv thomas32426 mryang23 smzcc yhyu13 wyw-python 5wang arunkumarramanan azurathena fang-haoshu lamhocn okanlv nilportugues setasouto mathieuorhan honggexiao wolegechu akirasosa wqy91

pytorch-yolov3's Issues

Only found just one loss

Hi,
I just found one loss on one scale. Could you please figure out where are the other two losses?

no save_weights function

hi, I didn't found save_weights function in this code. how can I save the param and test result, thx...

Some problems with saving weights

I've been followed this repo for about a week and make it work to start training on my own data. But when I tried to save weights, I got an error:

   AttributeError: 'Darknet' object has no attribute 'seen'

I checked in models.py. Unfortunately, self.seen and self.header_info are missing parameters in class Darknet. I looked up some materials and make them declared as:

self.seen = 0
self.header_info = torch.IntTensor([0,0,0,0,0])

add two sentences in save_weights:

    header = self.header_info
    header = header.numpy()

and changed self.header_info.tofile(fp) into header_info.tofile(fp). After all these modification, weights can be saved and loaded. But when I loaed them in detect.py , no detections can be made.

I wondered this problem is caused either my modification on seen and header_info or too few epoches. It'll be really nice of you to offer me some suggestion. Thanks for your time.

MAP

The computation of MAP in test.py actually is precision.

cannot train loss all nan batch =8

Finetune from darknet or train from scratch?

the epoch is 30, I fear, this implementation is finetune from darknet??

How about the training loss value? My implementation of YOLOv3 loss is very big and difficult to converge because the gird cell not containing the object is too large. Do you change something different from the original implementation?

traini with my own data

Hi, i change the utils/datasets.py and some config
i also make a mini train dataset which include 8 images and labels.
but when i train with it , i find the loss is very big(total loss 1000 +),so i cant get the right result for detection.
so, is there some possible bug in training code?

error when test

Namespace(batch_size=1, class_path='data/coco.names', conf_thres=0.8, config_path='config/yolov3.cfg', image_folder='data/samples', img_size=416, n_cpu=8, nms_thres=0.4, weights_path='weights/yolov3.weights')
Traceback (most recent call last):
File "detect.py", line 41, in
model.load_weights(opt.weights_path)
File "/mnt/disk_4T_1/Project/Detection/PyTorch-YOLOv3/old_source_code/PyTorch-YOLOv3/models.py", line 250, in load_weights
bn_layer.running_mean.data.copy_(bn_rm)
File "/home/qian/anaconda3/envs/py35torch/lib/python3.5/site-packages/torch/tensor.py", line 407, in data
raise RuntimeError('cannot call .data on a torch.Tensor: did you intend to use autograd.Variable?')
RuntimeError: cannot call .data on a torch.Tensor: did you intend to use autograd.Variable?

Question about build_targets

In models.py build_targets is called by passing dim=g_dim (aka the input size) and anchors=scaled_anchors (aka anchors scaled down by stride) so that here in utils.py the IoU is computed between the groundtruth box scaled by dim and scaled_anchors (both zero centered).

This doesn't look right to me.
Shouldn't either the gt boxes be scaled by dim / stride or the scaled anchors not be scaled at all?

What is the real mAP of this implementation ?

I think the mAP you mention on the readme of the project is the one you computed with your old evaluation which wasn't a mAP computation. I know there has been a new implementation recently and I used it to compute a mAP on my own version of YOLOv3. The value I get for a 0.5 threshold (VOC mAP) is 0.65 with your mAP implementation. I also wrote a mAP evaluation module and mine gives me a 0.41 VOC mAP. All these evaluations are done on the official weights and on the COCO dataset with 416x416 inputs.

The official score is 0.55, my evaluation might be wrong or it might be that my darknet implementation is missing things / have small mistakes. The fact that your evaluation gets such a high score tells me that it is almost certainly wrong though. The last option is that the author made a mistake himself which I would consider unlikely. I will review the code soon and complete the issue if I find mistakes but I thought I would mention it now since your mAP numbers are already in the readme.

Another completely different possibility is that your evaluation module is correct and the high score is explained by the fact that I use the validation set of 2014 and that the author trained the official weights on them. Which might be the most likely case because I just noticed that in the script he uses to download the coco dataset he creates a validation subset of 5k images.

I will consider the last possibility as the correct one and I'll reopen the issue if I notice errors in the evaluation module.

RuntimeError: unique is currently CPU-only, and lacks CUDA support. Pull requests welcome!

Traceback (most recent call last):
File "test.py", line 68, in
detections = non_max_suppression(detections, 80, opt.conf_thres, opt.nms_thres)
File "/PyTorch-YOLOv3/utils.py", line 73, in non_max_suppression
for c in detections[:, -1].unique():
File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 310, in unique
sorted=sorted, return_inverse=return_inverse)
RuntimeError: unique is currently CPU-only and lacks CUDA support. Pull requests welcome!

It executes wrong as above, it seems CPU only?
My environment is:
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
torch.version
'0.4.0'
4 TITanX GPU

GPU Insufficient memory！！

Hello
My GPU is 1080T. However, There has the problem of insufficient video memory when training. I think whether the subdivisions for the every batch is need.??

save weights error

Traceback (most recent call last):
File "train.py", line 108, in
model.save_weights('%s/%d.weights' % (opt.checkpoint_dir, epoch))
File "F:\cmp\VehicleDetection\PyTorch-YOLOv3\models.py", line 306, in save_weights
self.header_info[3] = self.seen
File "D:\software\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 532, in getattr
type(self).name, name))
AttributeError: 'Darknet' object has no attribute 'seen'

i have trained it on VOC2007 and can't achieved a good result though the loss converged.any suggestions?

multigpu

It seems can't use multigpu?

torch.cuda.is_available bug

test.py and train.py always return cuda = True because boolean is reflecting function presence rather than cuda presence. Also redundant if statement. Suggest replace
cuda = True if torch.cuda.is_available else False

with
cuda = torch.cuda.is_available()

Incorrect bounding box coordinates?

Hi, in utils.py, when transforming from center and width to the exact coordinates of box2, you do the following operation:
b2_x1, b2_x2 = box2[:, 0] - box1[:, 2] / 2, box2[:, 0] + box1[:, 2] / 2
b2_y1, b2_y2 = box2[:, 1] - box1[:, 3] / 2, box2[:, 1] + box1[:, 3] / 2
but I think it should be:
b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
since the width and height are referred to the second box. Am I right or am I missing something? Thanks in advance!

Error when training coco @batch 1473, epoch 0

Epoch 0/1000, Batch 1473/7329 | Losses: x 0.126482, y 0.129091, w 0.692879, h 0.779148, conf 0.268520, cls 1.789521, total 3.785640
Traceback (most recent call last):
File "/mnt/diskb/even/yolov3_pytorch/train.py", line 97, in
for batch_i, (_, imgs, targets) in enumerate(dataloader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 264, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 264, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/mnt/diskb/even/yolov3_pytorch/utils/datasets.py", line 73, in getitem
h, w, _ = img.shape
ValueError: not enough values to unpack (expected 3, got 0)

about trained weights

I followed the instruction and trained a model on COCO dataset which is provided by the author. After 20 epochs ,the total loss is about 0.2-0.3. When I use this new trained weights to detect the sample images, no detection can be made. Does anybody have the same situation?

error with own dataset

Hello,

I found an error in the last commit:

Traceback (most recent call last):
File "train.py", line 82, in
loss = model(imgs, targets)
File "/home/alupotto/anaconda3/envs/pt4_cu9_p35/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/alupotto/PycharmProjects/tracktorch/models.py", line 213, in forward
x, *losses = module[0](x, targets)
File "/home/alupotto/anaconda3/envs/pt4_cu9_p35/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/alupotto/PycharmProjects/tracktorch/models.py", line 177, in forward
return loss, loss_x.item(), loss_y.item(), loss_w.item(), loss_h.item(), loss_conf.item(), loss_cls.item(), float(nCorrect/nGT)
ZeroDivisionError: division by zero

I am training with my own data/labels and I had some images that didn't have ground truth at all So the file .txt exists but inside is empty.

I just added a check control before return float(nCorrect/nGT) in models.py like this:

AP = float(nCorrect/nGT) if nGT is not 0 else 0 and then I return AP.

I mention it in case somebody has the same problem or for future updates.

Train with voc dataset

Do you train yolov3 with voc dataset?

About the train and detector

in detect.py:(line 109)
for x1, y1, x2, y2, conf, cls_conf, cls_pred in detections:
IS this should be :
for x, y, w, h, conf, cls_conf, cls_pred in detections:
because in yolo layer: pred_boxes is [x,y,w,h]

the route layer filters num

I wonder why the route layer filter num https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/master/models.py#L45 is different from the https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/blob/master/darknet.py#L122. Here the route filter index is from 0 at output_filters, but the implement from the another link is from the current filter index at output_filters. Thanks for your reply.

train with own data/objects

Hi Erik,

First of all, thanks for the repo, is really clean and helpful.

I would like to ask you if is it possible to train with your own classes, similar to AlexeyAB implementation (https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects) but with Pytorch.

Thank you!

training error

[Epoch 0/30, Batch 0/14658] [Losses: x 0.155323, y 0.153960, w 1.839854, h 1.919111, conf 2.058636, cls 2.145512, total 8.272396, recall: 0.00000]
Traceback (most recent call last):
File "/mnt/diskb/even/yolov3_pytorch/train.py", line 100, in
model.seen += imgs.size(0)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 532, in getattr
type(self).name, name))
AttributeError: 'Darknet' object has no attribute 'seen'

Negative values for some coordinates

What is the reason of having negative values for some coordinates, when converting from
(center x, center y, width, height) to (x1, y1, x2, y2) (in non_max_suppresion function)?

For example, the values for x1 coordinates are calculated as:
box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2 and there are some negative values here.

box_corner[:, :, 0] <= 0 boils down to sigmoid(tx) + grid_x <= anchor_w * exp(tw) / 2.

How can this happen?

Thanks!

Clarification about AP calculation

The average precision calculation, as done in test.py, and compute_ap() differentiates between a correct vector of, say, [1,1,1,0] and [1,0,1,1], that is, the order of the predictions matter.

Is this necessary for object detection? If yes, why?
If I have an image with 3 ground truths, and the detection vector contains 2 true positives and 1 false positive, why does it matter if I find:
true positive, true positive, false positive -> [1,1,0] vs
true positive, false positive, true positive -> [1,0,1]?
The same predictions are made, but there'll be different APs. Is it fair?

Edit: I am not saying the algorithm is wrong, but some things are fuzzy for me.

Thanks!

What's the meaning of tw and th in your loss function?

When you calculate loss, the code about tw and th like this:

# Width and height
tw[b, best_n, gj, gi] = math.log(gw/anchors[best_n][0] + 1e-16)
th[b, best_n, gj, gi] = math.log(gh/anchors[best_n][1] + 1e-16)

I don't understand why you use log,and in the paper, the loss function about w and h is

And why loss_x and loss_y use BCELoss? It's a regression problem,I think MSELoss is applicable.
Can you explain it ? THX.

About the anchors

I have two questions about the anchors used in the config file of YOLOv3:

1: The smallest anchors are used at the end and the biggest at the begining, why is that and shouldn't it be the other way around ?

2: What format are the anchors ? Are they based on the size of the input, so if I compute relative anchor sizes (between 0 and 1) and that my input size is 416x416 then I just multiply the sizes by these values and write that in the config file ?

Error in training

[Epoch 0/30, Batch 1830/1833] [Losses: x 0.018887, y 0.017350, w 0.011480, h 0.009068, conf 0.032589, cls 0.145579, total 0.234953, recall: 0.65000]
[Epoch 0/30, Batch 1830/1833] [Losses: x 0.027525, y 0.027824, w 0.010730, h 0.007892, conf 0.054383, cls 0.179986, total 0.308339, recall: 0.81111]
[Epoch 0/30, Batch 1830/1833] [Losses: x 0.012489, y 0.012089, w 0.010546, h 0.007774, conf 0.033698, cls 0.192683, total 0.269278, recall: 0.54545]
[Epoch 0/30, Batch 1830/1833] [Losses: x 0.042511, y 0.042000, w 0.023847, h 0.016442, conf 0.075596, cls 0.151306, total 0.351703, recall: 0.67361]
[Epoch 0/30, Batch 1830/1833] [Losses: x 0.055068, y 0.053277, w 0.060733, h 0.038808, conf 0.099795, cls 0.244854, total 0.552534, recall: 0.65027]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.034622, y 0.033223, w 0.014849, h 0.012658, conf 0.059653, cls 0.104131, total 0.259136, recall: 0.67568]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.038405, y 0.035833, w 0.044459, h 0.039020, conf 0.054494, cls 0.141215, total 0.353426, recall: 0.52899]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.015338, y 0.015971, w 0.013100, h 0.005587, conf 0.035706, cls 0.174440, total 0.260142, recall: 0.52381]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.023167, y 0.022108, w 0.033820, h 0.017704, conf 0.047694, cls 0.193106, total 0.337599, recall: 0.50617]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.014519, y 0.014234, w 0.011042, h 0.005328, conf 0.033018, cls 0.121948, total 0.200090, recall: 0.59259]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.022603, y 0.022197, w 0.011902, h 0.008511, conf 0.041803, cls 0.088609, total 0.195626, recall: 0.61111]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.016306, y 0.014795, w 0.009144, h 0.004840, conf 0.039758, cls 0.154830, total 0.239672, recall: 0.62963]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.021347, y 0.020986, w 0.019330, h 0.012508, conf 0.042181, cls 0.147103, total 0.263455, recall: 0.60870]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.040201, y 0.038973, w 0.015251, h 0.036656, conf 0.073650, cls 0.174954, total 0.379684, recall: 0.54815]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.026244, y 0.026605, w 0.030584, h 0.021944, conf 0.048804, cls 0.160845, total 0.315027, recall: 0.53846]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.010117, y 0.009941, w 0.004098, h 0.005762, conf 0.027249, cls 0.178949, total 0.236117, recall: 0.60000]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.037965, y 0.038500, w 0.014596, h 0.019955, conf 0.056791, cls 0.110870, total 0.278677, recall: 0.64912]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.013917, y 0.013995, w 0.008014, h 0.005817, conf 0.034943, cls 0.167149, total 0.243835, recall: 0.47222]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.027614, y 0.026907, w 0.021914, h 0.014159, conf 0.062784, cls 0.209417, total 0.362795, recall: 0.61111]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.035124, y 0.033940, w 0.018896, h 0.013856, conf 0.060784, cls 0.174728, total 0.337329, recall: 0.71053]
[Epoch 0/30, Batch 1831/1833] [Losses: x 0.023849, y 0.023081, w 0.040256, h 0.029740, conf 0.047857, cls 0.191115, total 0.355898, recall: 0.53333]
[Epoch 0/30, Batch 1832/1833] [Losses: x 0.061831, y 0.061271, w 0.036845, h 0.032102, conf 0.117929, cls 0.158679, total 0.468657, recall: 0.66667]
[Epoch 0/30, Batch 1832/1833] [Losses: x 0.048841, y 0.048758, w 0.083654, h 0.054965, conf 0.089770, cls 0.146247, total 0.472235, recall: 0.59259]
[Epoch 0/30, Batch 1832/1833] [Losses: x 0.036920, y 0.038865, w 0.014582, h 0.023073, conf 0.077787, cls 0.183660, total 0.374887, recall: 0.60000]
[Epoch 0/30, Batch 1832/1833] [Losses: x 0.031886, y 0.031555, w 0.011043, h 0.014374, conf 0.057977, cls 0.117786, total 0.264621, recall: 0.60000]
Traceback (most recent call last):
File "train.py", line 88, in
loss = model(img, target)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/wang/jiayunpeng/jpy-yolo/PyTorch-YOLOv3-master/models.py", line 194, in forward
x = module(x)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: input has less dimensions than expected

about backward

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

About training

After running 8 epochs on my 1080ti about 6 hours, and the checkpoints folder has 7 files: [0..7].weights. I run "python detect.py --weights_path checkpoints/7.weights" to detect.But no detections, if I download the wights from "https://pjreddie.com/media/files/yolov3.weights", It can detect, So I suspect there is something wrong with the training, or I have to wait until 30 epochs.

Possible bugs in YOLO Layer's forward

The no object loss component of the confidence loss:

self.lambda_noobj * self.bce_loss(conf * (1 - mask), mask * (1 - mask))

basically compares conf * (1 - mask) with mask * (1 - mask), but mask * (1 - mask) will be a tensor full of zeros. I think it should be:

self.lambda_noobj * self.bce_loss(conf * (1 - mask), tconf * (1 - mask))

The AP calculation has 2 problems:

isn't nCorrect / nGT actually the recall? precision is the number of correct predictions divided by the number of all predictions.
if there are no ground truths, nCorrect / nGT will be nan, so the correct expression must be:
1 if (nCorrect == nGT == 0) else (nCorrect / nGT)

No detection after overfitting my own dataset.

I trained this model for detection of hands. and to first check the model i trained it on 32 images for 60 epochs and this is what i am getting after 60 epochs.

but when i ran detect.py on same dataset that i overfitted. there are no bounding boxes
i changed configuration i.e number of classes from 80 to 1, filters from 255 to 18. also the coco.data and coco.names file.
the labels are as follows
[class name, width_center, height_center , width, height]
example [0, 0.4, 0.3 , 0.2 , 0.1]

when i print detections in detect.py. it gives [nan,nan, nan.....]

But no results in output

Computing confidence mask

In build_targets function, at the beginning, there's a part that calculates the confidence mask tensor.
Initially it is set to a tensor of ones, but the update rule:

    # Objects with higher confidence than threshold are set to zero
    conf_mask[b][cur_ious.view_as(conf_mask[b]) > ignore_thres] = 0

doesn't make sense to me. This basically ignores any ious better than ignore_thres (currently set to 0.5).

I'd think that:

either start with a tensor of zeros and use the update rule:
conf_mask[b][cur_ious.view_as(conf_mask[b]) > thres] = 1
either change the sign: conf_mask[b][cur_ious.view_as(conf_mask[b]) <= thres] = 0

Thanks

no result

I trained my own dataset. Now i use detect.py to detect the image, but i only find a big white bounding box around the image, there is nothing else. Does anyone help? Thanks.

About loading the labels

Why you do this one in the datasets.py?
x1 = w * (labels[:, 1] - labels[:, 3]/2)
y1 = h * (labels[:, 2] - labels[:, 4]/2)
x2 = w * (labels[:, 1] + labels[:, 3]/2)
y2 = h * (labels[:, 2] + labels[:, 4]/2)
# Adjust for added padding
x1 += pad[1][0]
y1 += pad[0][0]
x2 += pad[1][0]
y2 += pad[0][0]
# Calculate ratios from coordinates
#print labels,float(h) / float(padded_h), h, padded_h, w, padded_w
'''
labels[:, 1] = ((x1 + x2) / 2) / padded_w
labels[:, 2] = ((y1 + y2) / 2) / padded_h
labels[:, 3] *= w / padded_w
labels[:, 4] *= h / padded_h
'''

        labels[:, 3] *= float(w) / float(padded_w)
        labels[:, 4] *= float(h) / float(padded_h)
        labels[:, 1] = ((x1 + x2) / 2) / float(padded_w)
        labels[:, 2] = ((y1 + y2) / 2) / float(padded_h)

If I want to apply on video file, how can I do?

Thanks in advance

a bug in datasets.py

labels[:, 3] *= w / padded_w
labels[:, 4] *= h / padded_h
to
labels[:, 3] *= float(w) / padded_w
labels[:, 4] *= float(h) / padded_h

training error

when I run train.py I got a error:
Traceback (most recent call last):
File "/home/lc/Downloads/PyTorch-YOLOv3/train.py", line 88, in
loss += model(sub_imgs, sub_targets)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/lc/Downloads/PyTorch-YOLOv3/models.py", line 214, in forward
x = module[0](x, targets)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/lc/Downloads/PyTorch-YOLOv3/models.py", line 175, in forward
loss_cls = self.class_scale * self.ce_loss(pred_cls, tcls)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/loss.py", line 759, in forward
self.ignore_index, self.reduce)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1442, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 944, in log_softmax
return torch._C._nn.log_softmax(input, dim)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
this error place is in :
def forward(self, x, targets=None):
elif module_def['type'] == 'yolo':
x = module[0](x, targets)
# If predictions: concatenate / if loss: add to total loss
output = output + [x] if targets is None else output + x

about training loss

when i train with coco or my own data
the loss is not convergence especially for w and h

371: nGT 19, recall 13, AP 68.42% proposals 0, loss: x 3.118402, y 3.105752, w nan, h nan, conf 164.831421, cls 83.258507, total nan
372: nGT 42, recall 11, AP 26.19% proposals 0, loss: x 3.763822, y 4.745901, w nan, h nan, conf 174.948318, cls 122.696747, total nan
372: nGT 42, recall 21, AP 50.00% proposals 0, loss: x 6.076531, y 5.691019, w nan, h nan, conf 317.763611, cls 157.752930, total nan
372: nGT 42, recall 34, AP 80.95% proposals 0, loss: x 6.518974, y 5.134117, w nan, h nan, conf 478.700623, cls 175.281006, total nan
373: nGT 7, recall 5, AP 71.43% proposals 0, loss: x 0.831325, y 1.967510, w nan, h nan, conf 55.965015, cls 30.674187, total nan
373: nGT 7, recall 7, AP 100.00% proposals 0, loss: x 1.058911, y 1.067211, w nan, h nan, conf 101.454453, cls 30.674187, total nan
373: nGT 7, recall 7, AP 100.00% proposals 0, loss: x 0.877475, y 0.707009, w nan, h nan, conf 135.526093, cls 30.674187, total nan
374: nGT 29, recall 7, AP 24.14% proposals 0, loss: x 4.089700, y 4.640538, w nan, h nan, conf 129.530502, cls 113.932693, total nan
374: nGT 29, recall 12, AP 41.38% proposals 0, loss: x 3.668809, y 6.106874, w nan, h nan, conf 198.104630, cls 127.078773, total nan
374: nGT 29, recall 21, AP 72.41% proposals 0, loss: x 4.712764, y 5.478906, w nan, h nan, conf 295.735779, cls 127.078773, total nan
375: nGT 15, recall 4, AP 26.67% proposals 0, loss: x 2.289522, y 1.391438, w nan, h nan, conf 64.979248, cls 48.202293, total nan
375: nGT 15, recall 7, AP 46.67% proposals 0, loss: x 2.598335, y 2.003327, w nan, h nan, conf 118.649559, cls 61.348373, total nan
375: nGT 15, recall 11, AP 73.33% proposals 0, loss: x 3.218783, y 3.078263, w nan, h nan, conf 178.149246, cls 65.730400, total nan
376: nGT 14, recall 2, AP 14.29% proposals 0, loss: x 1.407205, y 2.768980, w nan, h nan, conf 46.998051, cls 48.202293, total nan
376: nGT 14, recall 6, AP 42.86% proposals 0, loss: x 1.456108, y 3.255125, w nan, h nan, conf 91.687042, cls 56.966347, total nan
376: nGT 14, recall 8, AP 57.14% proposals 0, loss: x 2.827966, y 2.196316, w nan, h nan, conf 131.248718, cls 61.348373, total nan
377: nGT 10, recall 7, AP 70.00% proposals 0, loss: x 0.662892, y 1.921333, w nan, h nan, conf 107.326248, cls 43.820267, total nan
377: nGT 10, recall 9, AP 90.00% proposals 0, loss: x 0.840957, y 1.428768, w nan, h nan, conf 140.404282, cls 43.820267, total nan
377: nGT 10, recall 10, AP 100.00% proposals 0, loss: x 0.755548, y 1.235753, w nan, h nan, conf 174.077911, cls 43.820267, total nan

CrossEntropyLoss after Sigmoid on class predictions

I think the default torch.nn.CrossEntropyLoss(size_average=False) loss used between predicted and true classes is not the correct choice, and here's why:

Class predictions are passed through a sigmoid: pred_cls = torch.sigmoid(prediction[:, :, 5:]), therefore, the values are between [0, 1].
In the perfect case (where truth class = 1 and other classes = 0), the difference (in magnitude) is (and always will be at most) 1. Taking the log of the softmax (what CrossEntropyLoss does) of a vector with values between [0, 1] isn't helping the loss function to clearly detect the correct class, again, because the relative difference is small.

In my understanding, CrossEntropyLoss measures the relative difference between the truth class and other classes (i.e.: a good classification = large value for truth class and small values for other classes). For example: [0.1, 2.1, 23.1, 0.7] is a good prediction for class [3], the relative difference between the 3rd element and the rest is big. Please correct me if I am wrong.

Maybe we can just use MSELoss for classes, too?

Question regarding build_targets

I found that conf_mask[b, anch_ious > ignore_thres] = 0 will overwrite any previous ground truth target that has conf_mask set to 1.

    for t in range(target.shape[1]):
        if target[b, t].sum() == 0:
            continue
        .......
        # Calculate iou between gt and anchor shapes
        anch_ious = bbox_iou(gt_box, anchor_shapes)
        # Where the overlap is larger than threshold set mask to zero (ignore)
        conf_mask[b, anch_ious > ignore_thres] = 0    <---------------------
        # Find the best matching anchor box
        best_n = np.argmax(anch_ious)
        ...
        # Masks
        mask[b, best_n, gj, gi] = 1
        conf_mask[b, best_n, gj, gi] = 1    <-------------------

Let's say:

Iter1: anch_ious = [0.567 0.305 0.43], anch_ious>ignore_thres=0,best_n=0, conf_mask[b, 0, gj, gi]= 1
Iter2: anch_ious = [0.667 0.045 0.33], anch_ious>ignore_thres=0,best_n=0, conf_mask[b, 0, gj, gi]= 1
Iter3: anch_ious = [0.734 0.025 0.22], anch_ious>ignore_thres=0,best_n=0, conf_mask[b, 0, gj, gi]= 1

So every iteration is erasing the previous iteration's ground truth target's conf_mask if the target is selecting the same anchor. For the example above, only Iter3: conf_mask[b, 0, gj, gi]= 1is kept. It seems like more than half of the ground truth target's are usually ignored during training. Is this intended behavior and why would it work for training?

how to retrain on custom dataset

Hi,

I do have custom dataset with bounding box info, I want to retrain Yolo-tiny, how should this be possible in this minimal version?

an error when running get_coco_dataset.sh

when i try to run :

     sh  get_coco_dataset.sh

there comes an error:

     Syntax error: "(" unexpected

core dumped

Core was generated by `python train.py'. Program terminated with signal SIGSEGV, Segmentation fault.
Core dumped!!!

Issue when detecting with own weights

I trained the model on my own dataset, and got weights from that. When I want to detect objects using these weights, I get the following error:

Traceback (most recent call last): File "detect_OP.py", line 42, in <module> model.load_weights(opt.weights_path) File "/home/robzelluf/Desktop/PyTorch-YOLOv3/models.py", line 265, in load_weights conv_w = torch.from_numpy(weights[ptr:ptr + num_w]).view_as(conv_layer.weight) File "/home/robzelluf/.local/lib/python3.5/site-packages/torch/tensor.py", line 230, in view_as return self.view(tensor.size()) RuntimeError: invalid argument 2: size '[1024 x 512 x 3 x 3]' is invalid for input with 3837339 elements at /pytorch/aten/src/TH/THStorage.c:41

Can anyone help me with this?

cannot get right result when training

Training error

Hi, i wanna retrain use coco dataset.

there is error but i dont know how to solve it

Traceback (most recent call last):
  File "train.py", line 81, in <module>
    for batch_i, (_, imgs, targets) in enumerate(dataloader):
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/kdy/yolov3/utils/datasets.py", line 71, in __getitem__
    h, w, _ = img.shape
ValueError: not enough values to unpack (expected 3, got 0)

Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f19fb602d30>>
Traceback (most recent call last):
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/home/kdy/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused