wjun0830 / qd-detr Goto Github PK

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)

Home Page: https://arxiv.org/abs/2303.13874

License: Other

Python 97.94% Shell 2.06%

computer-vision moment-retrieval multi-modal video-highlight-detection video-retrieval video-summarization text-video-retrieval deep-learning detection-transformer

qd-detr's People

Contributors

Stargazers

Watchers

Forkers

josephdipalma onlyonewater churuanh0 aspnetcs wenjiajia123 kerolosatef seulbinhwang kavindie dgymjol willianzhuo jungnerd mulitibyte

qd-detr's Issues

At training, RuntimeError: The size of tensor a (148) must match the size of tensor b (150) at non-signleton dimesnion 1

hello. When training a model with video_tep, I am getting a RunError when calculating loss with inference.py after 5 epochs (i.e., criterion) because the tensor shape does not fit. These are the latest git files, what could be the problem?

Thanks.

**sorry, I address this issue. (i.e. config.py)

PLEASE DELETE THIS ISSUE.**

Training Machine Question

First, great works!
May I ask what machines your model were trained with?

About training on Charades.

Excuse me, I couldn't reproduce the results reported in the paper on the Charades dataset, even after setting the parameters according to the issue #1 . The C3D features used are obtained from https://drive.google.com/file/d/1CcMwae55Tuve_Ksrp5kONycyR1bVcX8D/view. Furthermore, for the slowfast&clip features, i have modified the code as follows:

# start_end_dataset.py 
                  if self.dset_name == 'charades':
                      model_inputs["saliency_pos_labels"], model_inputs["saliency_neg_labels"], model_inputs["saliency_all_labels"] = \
                      self.get_saliency_labels_sub_as_query(meta["relevant_windows"][0], ctx_l)  # only one gt

and in the eval.py, i modified the "mk_gt_scores" function:

def mk_gt_scores(gt_data, clip_length=1):
    """gt_data, dict, """
    # print("gt_data[duration]",gt_data["duration"])
    num_clips = int(gt_data["duration"] / clip_length)
    saliency_scores_full_video = np.zeros((num_clips, 3))
    relevant_clip_ids = np.arange(int(gt_data["relevant_windows"][0][0]), int(gt_data["relevant_windows"][0][1]))
    # FIXME
    saliency_scores_relevant_clips = np.ones((relevant_clip_ids.shape[0],3))  # (#relevant_clip_ids, 3)
    saliency_scores_full_video[relevant_clip_ids] = saliency_scores_relevant_clips
    return saliency_scores_full_video  # (#clips_in_video, 3)  the scores are in range [0, 4]

Actually, I found that if I don't modify the "mk_gt_scores" code and simply comment out line pred_saliency_scores=saliency_scores[idx] in the inference.py, it produces the same result.

cur_query_pred = dict(
                qid=meta["qid"],
                query=meta["query"],
                vid=meta["vid"],
                pred_relevant_windows=cur_ranked_preds,
                # pred_saliency_scores=saliency_scores[idx]
            )

so, can you help me to reproduce the results reported in the paper on the Charades dataset, thanks.

The I3D features about Charades-STA

Hi, may I ask if the Charades dataset uses the I3D features provided by VSLNET? If yes, what is the value of your 'clip_len' parameter? If not, could you please provide the code for extracting video features?

The implement of rank-aware contrastive loss

Thank you for your open-source code.
Here I have a question about the code of rank-aware contrastive loss metioned in the Section3.4 of the paper:

This is the rank-aware contrastive loss formula listed in the paper,

This is the code of rank-aware contrastive loss computation in the qd_detr/model.py:
# softmax
exp_logits = torch.exp(logits)
log_prob = logits - torch.log(exp_logits.sum(1, keepdim=True) + 1e-6)
mean_log_prob_pos = (pos_mask * log_prob * vid_token_mask).sum(1) / (pos_mask.sum(1) + 1e-6)
loss = - mean_log_prob_pos * batch_drop_mask
loss_rank_contrastive = loss_rank_contrastive + loss.mean()

I don't think the code implement matched the formula? Especially the bolded code makes me confused.
If I get it wrong, please inform me. Thank you for your time.

Confusion about the code

Hello, I'm studying the code of your repository and I have a question about it. What does the variable "ctx_mode"（default=“video_tef”） indicate?

fail to download TVsum dataset

The TVsum dataset from your given link is unsuccessful，can you give the dataset？

Interference

Hello, can I run the code in the run_on_video you provided on my own video like Moment-DETR? If possible, can you provide the best checkpoint of your training and validation process? Thanks!

About descrption of results file.

Hello,
Thank you for your interesting work. I have some questions about your code and paper. Is the result in Table 1 of the paper corresponding to the results/video_checkpoint_best_hl_val_preds_metrics.json in the "results" folder?

Best,
Eden

Question Regarding Feature Extraction Discrepancy Between Training & Inference

Hello, congratulations and thank you for sharing this awesome project!

I am cross posting a question from the Moment-DETR repo: jayleicn/moment_detr#26

In the paper and training code, it seems that both SlowFast and CLIP video features are used. But during inference time, it looks like only CLIP features are being used (based on the run.py file).

Am I understanding this correctly? If yes, what is the cause for this discrepancy?

Training on Charades-STA dataset with VGG backbone

Sorry for bothering. When I train the model on Charades-STA dataset with VGG backbone, I follow one of the issues that set the clip_len as 0.1666. However, I encounter the problem that it will easily cause the phenomenon that the neg_pool is None and it can not sample the indice.

So, I wonder if any solution to that. Thanks!

Pretraining Modules with Contrastive Learning？

Did you first pretrain the modules (Cross-Attn Trans & Trans Enc) using contrastive learning, and then train the entire model?

Missing opt.json file

Hi WonJun,

While running the command bash qd_detr/scripts/inference.sh results/checkpoints/model_best.ckpt 'val' ({direc} is replaced with my path), I'm getting a missing opt.json error:

FileNotFoundError: [Errno 2] No such file or directory: 'results/checkpoints/opt.json'

It seems the test options file is missing?

Best,
Noga

Use videoonly.ckpt

I read some other issues where you comment that the videoonly model has been trained with clip+slowfast features. Can you explain how to extract the features with slowfast and agregate them to clip ones in the run_on_video.py script?

Thanks in advance!

Fail to download TVsum dataset and could you please provide a new link？

With the same seed, the set of eval_epoch can really influence the performance of model! Why?

thanks for ur excellent work, and i found an interesting thing that
with the same seed, the set of eval_epoch can really influence the performance of model! (I have test the set of eval_epoch 1 and 5)
and i don't know why!

About Loss value problem

Hello, when reproducing this code, I found that the loss values are somewhat strange. The loss values are generally large, and it seems that they are not converging. Especially, the value of loss_saliency exceeds 4, and the value of loss_label shows an increasing trend in 40 evaluations. I would like to ask if you have encountered such a problem in your work, whether the loss values are normal, and why they are slightly large. I attach the "eval.log.txt" file in my reproducing results as follows:
eval.log.txt

What does parameter "use_tef" mean?

Hello,
Thanks for the great work.I would like to ask what this parameter"use_tef" represents. Why concat "tef" with "model_inputs["video_feat"] " if "use_tef" is set True. The code on line 97 of file "QD-DETR/qd_detr/start_end_dataset.py" shows below:
if self.use_tef:
tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
tef_ed = tef_st + 1.0 / ctx_l
tef = torch.stack([tef_st, tef_ed], dim=1) # (Lv, 2)
if self.use_video:
model_inputs["video_feat"] = torch.cat(
[model_inputs["video_feat"], tef], dim=1) # (Lv, Dv+2)
else:
model_inputs["video_feat"] = tef
best,
Eason

Will you release a cleaner and less rigid requirements.txt?

Too hard for offline installation :)

Can not completely reproduce reported video-only results on QVHighlights with the default configs

Hello, I tried to reproduce the video-only results with this official source codes, but got a weaker results as follows:

[email protected]: 59.99
[email protected]: 41.31
test_MR-full-mAP: 36.60
[email protected]: 60.45
[email protected]: 35.78
test_MR-long-mAP: 44.37
test_MR-middle-mAP: 36.94
test_MR-short-mAP: 7.32
test_HL-min-VeryGood-mAP: 38.56
test_HL-min-Good-mAP: 63.94
test_HL-min-Good-Hit1: 73.28
test_HL-min-Fair-mAP: 74.76
test_HL-min-VeryGood-Hit1: 61.54
test_HL-min-Fair-Hit1: 75.10
[email protected]: 61.94
[email protected]: 44.06
val_MR-full-mAP: 38.86
[email protected]: 61.13
[email protected]: 39.06
val_MR-long-mAP: 44.46
val_MR-middle-mAP: 41.29
val_MR-short-mAP: 7.31
val_HL-min-VeryGood-mAP: 39.33
val_HL-min-Good-mAP: 63.67
val_HL-min-Good-Hit1: 73.42
val_HL-min-Fair-mAP: 74.61
val_HL-min-VeryGood-Hit1: 62.90
val_HL-min-Fair-Hit1: 75.42

train.sh script is nearly the same as given:

dset_name=hl
ctx_mode=video_tef
v_feat_types=slowfast_clip
t_feat_type=clip 
results_root=results
exp_id=bs32_baseline

######## data paths
train_path=data/highlight_train_release.jsonl
eval_path=data/highlight_val_release.jsonl
eval_split_name=val

######## setup video+text features
feat_root=./features

# video features
v_feat_dim=0
v_feat_dirs=()
if [[ ${v_feat_types} == *"slowfast"* ]]; then
  v_feat_dirs+=(${feat_root}/slowfast_features)
  (( v_feat_dim += 2304 ))  # double brackets for arithmetic op, no need to use ${v_feat_dim}
fi
if [[ ${v_feat_types} == *"clip"* ]]; then
  v_feat_dirs+=(${feat_root}/clip_features)
  (( v_feat_dim += 512 ))
fi

# text features
if [[ ${t_feat_type} == "clip" ]]; then
  t_feat_dir=${feat_root}/clip_text_features/
  t_feat_dim=512
else
  echo "Wrong arg for t_feat_type."
  exit 1
fi

#### training
bsz=32


CUDA_VISIBLE_DEVICES=7 PYTHONPATH=$PYTHONPATH:. python qd_detr/train.py \
--dset_name ${dset_name} \
--ctx_mode ${ctx_mode} \
--train_path ${train_path} \
--eval_path ${eval_path} \
--eval_split_name ${eval_split_name} \
--v_feat_dirs ${v_feat_dirs[@]} \
--v_feat_dim ${v_feat_dim} \
--t_feat_dir ${t_feat_dir} \
--t_feat_dim ${t_feat_dim} \
--bsz ${bsz} \
--results_root ${results_root} \
--exp_id ${exp_id} \
${@:1}

My GPU is NVIDIA A100, PyTorch version is 1.12. What's the matter with my reproduction?

The hyper parameters of using SF+CLIP features on Charades-STA

Hi,

May I ask what the hyperparameters settings are when using SF+CLIP features on Charades-STA? Could you please provide the opt.json file?

the feature about charades-sta

hi, authors, could you provide the slowfast and clip feature about the charades-sta dataset? I want to train the model on it. thanks!!!

Training on Charades

Excuse me, how does your code train and test on the charades dataset? There are no related commands and information on GitHub, thank you.

ASR-pretrained checkpoints

Hi WonJun,

Do you provide the checkpoints for the ASR-pretrained models? I currently lack the resources to do the ASR pretraining, so I was hoping to be able to test using your checkpoints?

Best,
Noga

About run_on_videos

Hi and appreciate your work!
I find that you have your model qd_detr_ckpt in the run_on_video, and from project page I know maybe I can use the same command in moment_detr to run the run.py. is it true?
And my second question is, how to train my model so it can be used as my_ckpt to replace the model in run_on_video, that is to say, how to train my own model (use your qd_detr train method) to be used in "run_on_video"?

a very unreasonable phenomenon

hi, authors! I found a very unreasonable phenomenon. In the test phase, several layers of encoder layers with default initialization parameters are added after the encoder of the model, and the salience accuracy(like HL-min-VeryGood-mAP, Hit1) remains basically unchanged. The model used is provided by the author on github.

SharePoint: That didn't work - user cannot be found in the directory

There was an error when I tried to download feature files for TVSum dataset from the Share Point link you gave.

Charades dataset feature

May I ask if the author can provide the slowfast and clip features of the charades dataset?

run_on_video error

In the file run_on_video/model_utils.py, the import statement for MomentDETR is incorrect.
The import statement is from qd_detr.model import build_transformer, build_position_encoding, MomentDETR, but MomentDETR is not present in qd_detr.model. Instead, QDDETR is present in qd_detr.model which should be used instead of MomentDETR.

But when i use QDDETR instead of MomentDETR,
from qd_detr.model import build_transformer, build_position_encoding, QDDETR as MomentDETR,

then run the run.py, a error happen, how to fix?

RuntimeError: Error(s) in loading state_dict for QDDETR:
        Missing key(s) in state_dict: "global_rep_token", "global_rep_pos", "transformer.t2v_encoder.layers.0.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.0.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.0.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.0.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.0.linear1.weight", "transformer.t2v_encoder.layers.0.linear1.bias", "transformer.t2v_encoder.layers.0.linear2.weight", "transformer.t2v_encoder.layers.0.linear2.bias", "transformer.t2v_encoder.layers.0.norm1.weight", "transformer.t2v_encoder.layers.0.norm1.bias", "transformer.t2v_encoder.layers.0.norm2.weight", "transformer.t2v_encoder.layers.0.norm2.bias", "transformer.t2v_encoder.layers.0.activation.weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.1.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.1.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.1.linear1.weight", "transformer.t2v_encoder.layers.1.linear1.bias", "transformer.t2v_encoder.layers.1.linear2.weight", "transformer.t2v_encoder.layers.1.linear2.bias", "transformer.t2v_encoder.layers.1.norm1.weight", "transformer.t2v_encoder.layers.1.norm1.bias", "transformer.t2v_encoder.layers.1.norm2.weight", "transformer.t2v_encoder.layers.1.norm2.bias", "transformer.t2v_encoder.layers.1.activation.weight", "transformer.encoder.layers.0.activation.weight", "transformer.encoder.layers.1.activation.weight", "transformer.decoder.layers.0.sa_qcontent_proj.weight", "transformer.decoder.layers.0.sa_qcontent_proj.bias", "transformer.decoder.layers.0.sa_qpos_proj.weight", "transformer.decoder.layers.0.sa_qpos_proj.bias", "transformer.decoder.layers.0.sa_kcontent_proj.weight", "transformer.decoder.layers.0.sa_kcontent_proj.bias", "transformer.decoder.layers.0.sa_kpos_proj.weight", "transformer.decoder.layers.0.sa_kpos_proj.bias", "transformer.decoder.layers.0.sa_v_proj.weight", "transformer.decoder.layers.0.sa_v_proj.bias", "transformer.decoder.layers.0.ca_qcontent_proj.weight", "transformer.decoder.layers.0.ca_qcontent_proj.bias", "transformer.decoder.layers.0.ca_qpos_proj.weight", "transformer.decoder.layers.0.ca_qpos_proj.bias", "transformer.decoder.layers.0.ca_kcontent_proj.weight", "transformer.decoder.layers.0.ca_kcontent_proj.bias", "transformer.decoder.layers.0.ca_kpos_proj.weight", "transformer.decoder.layers.0.ca_kpos_proj.bias", "transformer.decoder.layers.0.ca_v_proj.weight", "transformer.decoder.layers.0.ca_v_proj.bias", "transformer.decoder.layers.0.ca_qpos_sine_proj.weight", "transformer.decoder.layers.0.ca_qpos_sine_proj.bias", "transformer.decoder.layers.0.cross_attn.out_proj.weight", "transformer.decoder.layers.0.cross_attn.out_proj.bias", "transformer.decoder.layers.0.activation.weight", "transformer.decoder.layers.1.sa_qcontent_proj.weight", "transformer.decoder.layers.1.sa_qcontent_proj.bias", "transformer.decoder.layers.1.sa_qpos_proj.weight", "transformer.decoder.layers.1.sa_qpos_proj.bias", "transformer.decoder.layers.1.sa_kcontent_proj.weight", "transformer.decoder.layers.1.sa_kcontent_proj.bias", "transformer.decoder.layers.1.sa_kpos_proj.weight", "transformer.decoder.layers.1.sa_kpos_proj.bias", "transformer.decoder.layers.1.sa_v_proj.weight", "transformer.decoder.layers.1.sa_v_proj.bias", "transformer.decoder.layers.1.ca_qcontent_proj.weight", "transformer.decoder.layers.1.ca_qcontent_proj.bias", "transformer.decoder.layers.1.ca_kcontent_proj.weight", "transformer.decoder.layers.1.ca_kcontent_proj.bias", "transformer.decoder.layers.1.ca_kpos_proj.weight", "transformer.decoder.layers.1.ca_kpos_proj.bias", "transformer.decoder.layers.1.ca_v_proj.weight", "transformer.decoder.layers.1.ca_v_proj.bias", "transformer.decoder.layers.1.ca_qpos_sine_proj.weight", "transformer.decoder.layers.1.ca_qpos_sine_proj.bias", "transformer.decoder.layers.1.cross_attn.out_proj.weight", "transformer.decoder.layers.1.cross_attn.out_proj.bias", "transformer.decoder.layers.1.activation.weight", "transformer.decoder.query_scale.layers.0.weight", "transformer.decoder.query_scale.layers.0.bias", "transformer.decoder.query_scale.layers.1.weight", "transformer.decoder.query_scale.layers.1.bias", "transformer.decoder.ref_point_head.layers.0.weight", "transformer.decoder.ref_point_head.layers.0.bias", "transformer.decoder.ref_point_head.layers.1.weight", "transformer.decoder.ref_point_head.layers.1.bias", "transformer.decoder.bbox_embed.layers.0.weight", "transformer.decoder.bbox_embed.layers.0.bias", "transformer.decoder.bbox_embed.layers.1.weight", "transformer.decoder.bbox_embed.layers.1.bias", "transformer.decoder.bbox_embed.layers.2.weight", "transformer.decoder.bbox_embed.layers.2.bias", "transformer.decoder.ref_anchor_head.layers.0.weight", "transformer.decoder.ref_anchor_head.layers.0.bias", "transformer.decoder.ref_anchor_head.layers.1.weight", "transformer.decoder.ref_anchor_head.layers.1.bias", "saliency_proj1.weight", "saliency_proj1.bias", "saliency_proj2.weight", "saliency_proj2.bias".
        Unexpected key(s) in state_dict: "saliency_proj.weight", "saliency_proj.bias", "transformer.decoder.layers.0.multihead_attn.in_proj_weight", "transformer.decoder.layers.0.multihead_attn.in_proj_bias", "transformer.decoder.layers.0.multihead_attn.out_proj.weight", "transformer.decoder.layers.0.multihead_attn.out_proj.bias", "transformer.decoder.layers.0.self_attn.in_proj_weight", "transformer.decoder.layers.0.self_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.in_proj_weight", "transformer.decoder.layers.1.multihead_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.out_proj.weight", "transformer.decoder.layers.1.multihead_attn.out_proj.bias", "transformer.decoder.layers.1.self_attn.in_proj_weight", "transformer.decoder.layers.1.self_attn.in_proj_bias".
        size mismatch for query_embed.weight: copying a param with shape torch.Size([10, 256]) from checkpoint, the shape in current model is torch.Size([10, 2]).

what does parameter 'clip_len' mean?

hi, authors, great works, now I want to train the model on the charades-sta dataset. After I read the issues in jayleicn/moment_detr#11 and #1, I find that there are different values about 'clip_len', in the moment_detr method, the author set the 'clip_len' is 2, and in the QD-deter method, you set the 'clip_len' is 1, so I am confused, what does the 'clip_len' mean?

about feature extract

1.To use the pretrain model, could you please share the script for slowfast feature extraction?
2.Are the parameters for extracting CLIP features exactly the same as the moment-detr?

TVSUM data issue

Why do we need to divide by 80 and multiply by 12?

About ablation study

Thanks for your work, I have some questions on the ablation study.

In the Table4 in the paper, which is the Ablation study on QVHighlights val split, is the (a) row Moment-detr? and if so, is the (b) row that Moment-detr simply with cross-attentive transformer encoder ?
What should I do If I want to do the same ablation experiment like （b） row , that is only using CATE？

official feature files for QVHighlights dataset

Hi, can you provide the feature files? The original link is dead and the author of the dataset features is not available

Test Result

Hello, after I get the hl_test_submission.jsonl file, how can I get my model accuracy?

Eval

Excuse me, CodaLab can only upload 5 times, how to evaluate the results of V+A?

Regarding results in Tables 1 and 2 of paper

Hello,

I have a question regarding your paper - are the scores reported in Tables 1 and 2 with or without pretraining on ASR captions?

Best,
Noga

TVSUM Result

Excuse me, I found that there are five random seed values in the tvum code. After I reproduced your code, I found that the five results were very different. I would like to ask how the results in the paper were selected.

Missing umt_clip_text_features and umt_pann_features

Hi WonJun,

The qd_detr/scripts/train_audio.sh script makes use of features/umt_clip_text_features and features/umt_pann_features. However, these are missing in the moment_detr_features.tar.gz file. Could you upload these features as well, please?

Best,
Noga

TypeError: unsupported operand type(s) for /: 'dict' and 'float'

QD-DETR-main/run_on_video/data_utils.py", line 89, in call
tensor = tensor / 255.0

I cross-checked at tensor is indeed dict type.

wjun0830 / qd-detr Goto Github PK

qd-detr's People

Contributors

Stargazers

Watchers

Forkers

qd-detr's Issues

Recommend Projects

Recommend Topics

Recommend Org