joker316701882 / additive-margin-softmax Goto Github PK

View Code? Open in Web Editor NEW

492.0 26.0 149.0 2.51 MB

This is the implementation of paper <Additive Margin Softmax for Face Verification>

Python 29.10% Jupyter Notebook 70.90%

deeplearning facerecognition metric-learning sphereface am-softmax

additive-margin-softmax's People

Contributors

Stargazers

Watchers

Forkers

alexliyang kixiang wjgaas murari023 chenshihao123456 juhui0419 jac578 jiandanvc gaosq0604 993917172 z397164725 1018365842 samuel-ye homwen aust-hansen peterliu1991 samxiaosheng chendan003 qidiso duanzhibin lryanx chilicy zhaowwenzhong locussam shlpu quxiaofeng xcls1117 whuguozili amitkayal daijucug huoqiang1993 shaoxq lcorleone deeplearninggogogo queeson flyfrommiwang aktcob flyupwards marc45 zhangsn828 aarohi-joshi horaccefeng liyuanyaun lpye leafy1980 kanasani-monica rain2008204 parsonszeng timedcy boozyguo alireza-akhavan hanxueming126 17764591637 xianweilv liaoheping caotianwei oxyai jian-li windspire 394781865 weiyujian rgbitx yushenxiang johngalt18 netpcvnn chenglongchen xiongshufeng costume24 szywind dreadlord1984 leena201818 eycab swetmelon warmstar1986 ieyer luhm2017 maxiaomu trantorrepository wuyx zhen-ao tonywt fengsky401 guojiapeng00 wnov dapenggg yuhaoluo hazxone zhaomonica huyhoang17 bryanpurba rnradon chanioson why702 yimin-yuan pc-phoenix moshanghua333777 boosting yjingyu zoe-wan heshenghuan

additive-margin-softmax's Issues

Hi, How can I use function AM_logits_compute?

I tried to use your resface.py as inference network while using Facenet as trainning logic.
But I am not sure how to use function AM_logits_compute, neither using it to replace prelogits_center_loss nor using it to replace 'logits' parameter of tf.nn.sparse_softmax_cross_entropy_with_logits worked.
Can you tell me how to use funtion AM_logits_compute? Thanks!

About the Network backbone of the program

The res 20 network used in this program does not contain the 77 convolution layer and the 33 max-pooling or I miss them at some place？ If you did it on purpose, why?

about the bias initialization in insightface and resface

speak first,i appreciate your excellent work very much. but what confuses me a lot is why you initialize the last fc layer(bias term) in insightface.py using xavier_init() while in resface.py using 0.
Are there any explanations?

thanks a lot

I have a keras version of amsoftmax, but it not well in face converge is very slow

I do not know where have problem, can you have a look?

some question about the model

Hi, I want to know something more about the News on 2018/08/29/.
You mentioned that "latest experiment-- Resface20(bn) + vggface2 + weight_decay5e-4 + batch_size256 + momentum achieves 0.995+-0.003". I think Resface20 represents the model you used. But what does the "vggface2" means here? Does it means a pre-trained model ?
Gratitude for your reply soon, thanks~

Validation on LFW

Hi, @Joker316701882. I use the CASIA-Webface as training set and the LFW dataset as validation set. However, during the training process, the accuracy on LFW dataset is always 50% and the selected threshold is always 0. To find the problem, I read the source code and find the euclidean distance is used to calculate the distance between the two embedded features. As I know, AM-Softmax is a angular-based face recognition algorithm. My problem is:

Is that ok just using the euclidean-based distance to evaluate?
Using the euclidean-based algorithm to evaluate is logically feasible, so what's problem for the fixed accuracy? Is there anything I forget to configure before training?

why I can't load the pretrained model

the error is Key InceptionResnetV1/Block8/Branch_0/Conv2d_1x1/BatchNorm/beta/Adam not found in checkpoint
@

Cannot have number of splits n_splits=10 greater than the number of samples: n_samples=0.

ImportError: No module named models.resface

Hello, I got the error when using the default 'train.py' file, but I saw the models file, I can see the resface.py, Why?

Error:No module named 'validation_tool'

I come across this error when using the default 'train.py' file. Is the validation_tool a third party package that I need to install?

AssertionError: The number of LFW

兄弟，evaluation问题很大呀，能不能帮个忙呀

Nan appears during training

At the end of the training, Nan suddenly appeared.How to solve this problem?

ss

sss

No model saved after running train.py

After running train.py there is no model outputted by the pipeline.
Also, what kind of file you expect in the "--pretrained_model" flag, thanks!

acc on lfw cannot be imporoved over 0.9

Hello, I have tried several combinations of parameters while running train.py, such as epoch_size =200, max_epoch_num =500, keep_probability = 0.9. And the weight_decay and batch_size are set to 5e-4 and 256 as suggested. But the accuracy is around 0.9 on the lfw dataset. So I hope for some advice from you by setting training parameters .
What'more, I wonder whether it is because that I have not used any per-trained model. I'm gratitude to wait for your reply soon.

Pre-trained model

Dear @Joker316701882 ,

Can you share the Tensorflow pre-trained model, so we can download and play with it. For example, the model trained with VGGFace2.
Thank you very much. Btw, what a great repository!

code behaves baddly when i set weight-decay

hello? when i set weight-decay , No matter what parameter , learning rate and optimizer I set, it will behave baddly ? could you give me some tutor

problem on changing the model to inception_resnet_v2

i change the model to inception_resnet_v2,and didnt change any other thing ,used the default parameters,it gets divergent

After aligning the lfw, it only has 143 classes

Why it has only 143 classes after the alignment? Why you remove the other faces?

about cos_theta

Hi :
I have question : in order to get the cos_theta , here tf.matmul is used, when I train a net, "nan" log is showed, so I confused, tf.matmul get a matrix result, tf.multiply get a scalar value.
Can I use tf.multiply here ?

when I try to python train.py, test sets with lfw data sets ,error

Traceback (most recent call last):
File "train.py", line 542, in
main(parse_arguments(sys.argv[1:]))
File "train.py", line 213, in main
label_batch, lfw_paths, actual_issame, args.lfw_batch_size, args.lfw_nrof_folds, log_dir, step, summary_writer,best_accuracy,saver_save,model_dir,subdir)
File "train.py", line 313, in evaluate
_, _, accuracy, val, val_std, far = lfw.evaluate(emb_array, actual_issame, nrof_folds=nrof_folds)
File "/home/yjzx/Downloads/Additive-Margin-Softmax/lfw.py", line 43, in evaluate
np.asarray(actual_issame), 1e-3, nrof_folds=nrof_folds)
File "/home/yjzx/Downloads/Additive-Margin-Softmax/facenet.py", line 566, in calculate_val
val[fold_idx], far[fold_idx] = calculate_val_far(threshold, dist[test_set], actual_issame[test_set])
File "/home/yjzx/Downloads/Additive-Margin-Softmax/facenet.py", line 580, in calculate_val_far
val = float(true_accept) / float(n_same)
far = float(false_accept) / float(n_diff)
ZeroDivisionError: float division by zero

Does batch normalization increase the accuracy?

Hi,

I've tried batch normalization with resnet20 in caffe. But I found out the test accuracy drops.

Do you compare the accuracy with or w/o Batch normalization?

Some problems about align_lfw.py

As is mentioned is the README.md, I should align both the training set and lfw. Is this mean that I should use the align_lfw.py for twice? My setting is like following:
Step1 --"python align_lfw.py --input-dir /DataSet/lfwdataset/lfw-deepfunneled --output-dir /home/myname/lfwdataset"
Step2 --"python train.py --data_dir /home/myname/lfwdataset --random_flip --learning_rate -1 --learning_rate_schedule_file ./data/learning_rate_AM_softmax.txt --lfw_dir /home/myrname/lfwdataset --keep_probability 0.8 --weight_decay 5e-4"
Is there any problems? Hope for your reply soon, thank you~

Could you share the training part for AM-softmax ? Thanks

Thanks for your sharing of Additive-Margin-Softmax, I hope to reproduce your impressing results. Could you share the code for training procedure?

Why not just return net in inference?

It's just my little puzzlement that why to return net with null string in inference？https://github.com/Joker316701882/Additive-Margin-Softmax/blob/master/models/insightface.py#L67

Training time suddenly increases at some point?

Epoch: [0][82/87866] Time 0.469 Loss 22.866 RegLoss 0.000000
Epoch: [0][83/87866] Time 0.468 Loss 22.917 RegLoss 0.000000
Epoch: [0][84/87866] Time 0.491 Loss 22.838 RegLoss 0.000000
Epoch: [0][85/87866] Time 0.453 Loss 22.843 RegLoss 0.000000
Epoch: [0][86/87866] Time 0.459 Loss 22.799 RegLoss 0.000000
Epoch: [0][87/87866] Time 0.459 Loss 22.761 RegLoss 0.000000
Epoch: [0][88/87866] Time 0.473 Loss 22.805 RegLoss 0.000000
Epoch: [0][89/87866] Time 0.466 Loss 22.893 RegLoss 0.000000
Epoch: [0][90/87866] Time 0.456 Loss 22.780 RegLoss 0.000000
Epoch: [0][91/87866] Time 0.457 Loss 22.721 RegLoss 0.000000
Epoch: [0][92/87866] Time 0.484 Loss 22.789 RegLoss 0.000000
Epoch: [0][93/87866] Time 0.465 Loss 22.911 RegLoss 0.000000
Epoch: [0][94/87866] Time 0.470 Loss 22.807 RegLoss 0.000000
Epoch: [0][95/87866] Time 0.483 Loss 22.808 RegLoss 0.000000
Epoch: [0][96/87866] Time 0.459 Loss 22.879 RegLoss 0.000000
Epoch: [0][97/87866] Time 0.459 Loss 22.673 RegLoss 0.000000
Epoch: [0][98/87866] Time 0.457 Loss 22.901 RegLoss 0.000000
Epoch: [0][99/87866] Time 0.494 Loss 22.780 RegLoss 0.000000
Epoch: [0][100/87866] Time 0.471 Loss 22.781 RegLoss 0.000000
Epoch: [0][101/87866] Time 0.458 Loss 22.840 RegLoss 0.000000
Epoch: [0][102/87866] Time 0.480 Loss 22.848 RegLoss 0.000000
Epoch: [0][103/87866] Time 0.470 Loss 22.788 RegLoss 0.000000
Epoch: [0][104/87866] Time 0.461 Loss 22.811 RegLoss 0.000000
Epoch: [0][105/87866] Time 0.471 Loss 22.855 RegLoss 0.000000
Epoch: [0][106/87866] Time 0.467 Loss 22.886 RegLoss 0.000000
Epoch: [0][107/87866] Time 0.453 Loss 22.856 RegLoss 0.000000
Epoch: [0][108/87866] Time 0.462 Loss 22.899 RegLoss 0.000000
Epoch: [0][109/87866] Time 0.490 Loss 22.719 RegLoss 0.000000
Epoch: [0][110/87866] Time 0.460 Loss 22.970 RegLoss 0.000000
Epoch: [0][111/87866] Time 0.457 Loss 22.761 RegLoss 0.000000
Epoch: [0][112/87866] Time 0.468 Loss 22.725 RegLoss 0.000000
Epoch: [0][113/87866] Time 0.479 Loss 22.768 RegLoss 0.000000
Epoch: [0][114/87866] Time 0.462 Loss 22.871 RegLoss 0.000000
Epoch: [0][115/87866] Time 0.456 Loss 22.947 RegLoss 0.000000
Epoch: [0][116/87866] Time 0.472 Loss 22.737 RegLoss 0.000000
Epoch: [0][117/87866] Time 0.472 Loss 22.790 RegLoss 0.000000
Epoch: [0][118/87866] Time 0.454 Loss 22.665 RegLoss 0.000000
Epoch: [0][119/87866] Time 0.453 Loss 22.742 RegLoss 0.000000
Epoch: [0][120/87866] Time 1.965 Loss 22.788 RegLoss 0.000000
Epoch: [0][121/87866] Time 1.992 Loss 22.809 RegLoss 0.000000
Epoch: [0][122/87866] Time 1.956 Loss 22.791 RegLoss 0.000000
Epoch: [0][123/87866] Time 2.076 Loss 22.890 RegLoss 0.000000
Epoch: [0][124/87866] Time 2.076 Loss 22.677 RegLoss 0.000000
Epoch: [0][125/87866] Time 2.029 Loss 22.981 RegLoss 0.000000
Epoch: [0][126/87866] Time 2.021 Loss 22.807 RegLoss 0.000000
Epoch: [0][127/87866] Time 1.953 Loss 22.661 RegLoss 0.000000
Epoch: [0][128/87866] Time 1.979 Loss 22.750 RegLoss 0.000000
Epoch: [0][129/87866] Time 2.029 Loss 22.915 RegLoss 0.000000

你这里面的train.py 跟 insightface 里面的有啥区别吗

看起来主要借鉴的是facenet 里面的

Trouble with loading the checkpoint

Hello ! @Joker316701882
I train the resface20 model with vggface2 dataset .And I want to restore the model checkpoint by the
following code but it throw some error.

image_batch = tf.placeholder(tf.float32,shape=(None,112,96,3),name='input')
prelogits, _ = network.inference(image_batch, 1.0, phase_train=False, bottleneck_layer_size=512,
weight_decay=0.0)
embeddings = tf.nn.l2_normalize(prelogits, 1, 1e-10, name='embeddings')
saver = tf.train.Saver()
saver.restore(sess,tf.train.latest_checkpoint(save_path))

The error is:

NotFoundError: Key Resface/Bottleneck/BatchNorm/moving_mean not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_133 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_138_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I didn't change any code, and I train the model just following your readme.md instruction.

Please tell me how to deal with it , Thank you very much .

can't load saved model

when set arg 'pretrained_model' to the path of some saved model, it will complain with errors:

Caused by op u'save/RestoreV2_54', defined at:
File "train.py", line 528, in
main(parse_arguments(sys.argv[1:]))
File "train.py", line 162, in main
saver_load = tf.train.Saver(tf.trainable_variables(), max_to_keep=1)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1218, in init
self.build()
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1227, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
build_save=build_save, build_restore=build_restore)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
restore_sequentially, reshape)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
[spec.tensor.dtype])[0])
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/td/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./trained_model/20180409-183909/
[[Node: save/RestoreV2_54 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_54/tensor_names, save/RestoreV2_54/shape_and_slices)]]
[[Node: save/RestoreV2_7/_497 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_490_save/RestoreV2_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

ValueError on training

Im getting the following error while training.How can I be able to sort it out?

ValueError: Cannot have number of splits n_splits=10 greater than the number of samples: 0.

error:'_5_batch_join/fifo_queue' is closed and has insufficient elements (requested 256, current size 0)

Hi,I compiled for python train.py with lfw datasets,as follow error:
OutOfRangeError (see above for traceback): FIFOQueue '_5_batch_join/fifo_queue' is closed and has insufficient elements (requested 256, current size 0)
[[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_join/fifo_queue, _arg_batch_size_0_0)]]

Thanks for your help

TypeError: Fetch argument None has invalid type <class 'NoneType'>

what happend??how to solve it? thanks
I want to train this model use own datasests,just change the datasetpath,but error is:

2018-11-22 17:13:28.944300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-11-22 17:13:29.488915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-22 17:13:29.488996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-11-22 17:13:29.489019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-11-22 17:13:29.489518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 18335 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:af:00.0, compute capability: 6.1)
WARNING:tensorflow:Error encountered when serializing regularization_losses.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'NoneType' object has no attribute 'name'
Running training
training a epoch...
Traceback (most recent call last):
File "train.py", line 519, in
main(parse_arguments(sys.argv[1:]))
File "train.py", line 182, in main
total_loss, train_op, summary_op, summary_writer, regularization_losses, args.learning_rate_schedule_file)
File "train.py", line 248, in train
err, _, step, reg_loss, summary_str = sess.run([loss, train_op, global_step, regularization_losses, summary_op], feed_dict=feed_dict)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1125, in _run
self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 427, in init
self._fetch_mapper = _FetchMapper.for_fetch(fetches)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 245, in for_fetch
return _ListFetchMapper(fetch)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 352, in init
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 352, in
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 245, in for_fetch
return _ListFetchMapper(fetch)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 352, in init
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 352, in
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 242, in for_fetch
type(fetch)))
TypeError: Fetch argument None has invalid type <class 'NoneType'>

Train lfw for bug

Additive-Margin-Softmax ,train lfw as following eorror:
$python train.py --data_dir out_data1 --random_flip --learning_rate -1 --learning_rate_schedule_file ./data/learning_rate_AM_softmax.txt --lfw_dir data/lfw --keep_probability 0.8 --weight_decay 5e-4

Thanks for you !
molyswu