Code Monkey home page Code Monkey logo

wjun0830 / qd-detr Goto Github PK

View Code? Open in Web Editor NEW
170.0 4.0 13.0 1.26 GB

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)

Home Page: https://arxiv.org/abs/2303.13874

License: Other

Python 97.94% Shell 2.06%
computer-vision moment-retrieval multi-modal video-highlight-detection video-retrieval video-summarization text-video-retrieval deep-learning detection-transformer

qd-detr's People

Contributors

wjun0830 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

qd-detr's Issues

About training on Charades.

Excuse me, I couldn't reproduce the results reported in the paper on the Charades dataset, even after setting the parameters according to the issue #1 . The C3D features used are obtained from https://drive.google.com/file/d/1CcMwae55Tuve_Ksrp5kONycyR1bVcX8D/view. Furthermore, for the slowfast&clip features, i have modified the code as follows:

# start_end_dataset.py 
                  if self.dset_name == 'charades':
                      model_inputs["saliency_pos_labels"], model_inputs["saliency_neg_labels"], model_inputs["saliency_all_labels"] = \
                      self.get_saliency_labels_sub_as_query(meta["relevant_windows"][0], ctx_l)  # only one gt

and in the eval.py, i modified the "mk_gt_scores" function:

def mk_gt_scores(gt_data, clip_length=1):
    """gt_data, dict, """
    # print("gt_data[duration]",gt_data["duration"])
    num_clips = int(gt_data["duration"] / clip_length)
    saliency_scores_full_video = np.zeros((num_clips, 3))
    relevant_clip_ids = np.arange(int(gt_data["relevant_windows"][0][0]), int(gt_data["relevant_windows"][0][1]))
    # FIXME
    saliency_scores_relevant_clips = np.ones((relevant_clip_ids.shape[0],3))  # (#relevant_clip_ids, 3)
    saliency_scores_full_video[relevant_clip_ids] = saliency_scores_relevant_clips
    return saliency_scores_full_video  # (#clips_in_video, 3)  the scores are in range [0, 4]

Actually, I found that if I don't modify the "mk_gt_scores" code and simply comment out line pred_saliency_scores=saliency_scores[idx] in the inference.py, it produces the same result.

cur_query_pred = dict(
                qid=meta["qid"],
                query=meta["query"],
                vid=meta["vid"],
                pred_relevant_windows=cur_ranked_preds,
                # pred_saliency_scores=saliency_scores[idx]
            ) 

so, can you help me to reproduce the results reported in the paper on the Charades dataset, thanks.

The I3D features about Charades-STA

Hi, may I ask if the Charades dataset uses the I3D features provided by VSLNET? If yes, what is the value of your 'clip_len' parameter? If not, could you please provide the code for extracting video features?

The implement of rank-aware contrastive loss

Thank you for your open-source code.
Here I have a question about the code of rank-aware contrastive loss metioned in the Section3.4 of the paper:

This is the rank-aware contrastive loss formula listed in the paper,
图片

This is the code of rank-aware contrastive loss computation in the qd_detr/model.py:
# softmax
exp_logits = torch.exp(logits)
log_prob = logits - torch.log(exp_logits.sum(1, keepdim=True) + 1e-6)
mean_log_prob_pos = (pos_mask * log_prob * vid_token_mask).sum(1) / (pos_mask.sum(1) + 1e-6)
loss = - mean_log_prob_pos * batch_drop_mask
loss_rank_contrastive = loss_rank_contrastive + loss.mean()

I don't think the code implement matched the formula? Especially the bolded code makes me confused.
If I get it wrong, please inform me. Thank you for your time.

Confusion about the code

Hello, I'm studying the code of your repository and I have a question about it. What does the variable "ctx_mode"(default=“video_tef”) indicate?

Interference

Hello, can I run the code in the run_on_video you provided on my own video like Moment-DETR? If possible, can you provide the best checkpoint of your training and validation process? Thanks!

About descrption of results file.

Hello,
Thank you for your interesting work. I have some questions about your code and paper. Is the result in Table 1 of the paper corresponding to the results/video_checkpoint_best_hl_val_preds_metrics.json in the "results" folder?

Best,
Eden

Question Regarding Feature Extraction Discrepancy Between Training & Inference

Hello, congratulations and thank you for sharing this awesome project!

I am cross posting a question from the Moment-DETR repo: jayleicn/moment_detr#26

In the paper and training code, it seems that both SlowFast and CLIP video features are used. But during inference time, it looks like only CLIP features are being used (based on the run.py file).

Am I understanding this correctly? If yes, what is the cause for this discrepancy?

Training on Charades-STA dataset with VGG backbone

Sorry for bothering. When I train the model on Charades-STA dataset with VGG backbone, I follow one of the issues that set the clip_len as 0.1666. However, I encounter the problem that it will easily cause the phenomenon that the neg_pool is None and it can not sample the indice.

fc5b999b079cd96adcd63ea4433e9d5

So, I wonder if any solution to that. Thanks!

Missing opt.json file

Hi WonJun,

While running the command bash qd_detr/scripts/inference.sh results/checkpoints/model_best.ckpt 'val' ({direc} is replaced with my path), I'm getting a missing opt.json error:

FileNotFoundError: [Errno 2] No such file or directory: 'results/checkpoints/opt.json'

It seems the test options file is missing?

Best,
Noga

Use videoonly.ckpt

I read some other issues where you comment that the videoonly model has been trained with clip+slowfast features. Can you explain how to extract the features with slowfast and agregate them to clip ones in the run_on_video.py script?

Thanks in advance!

About Loss value problem

Hello, when reproducing this code, I found that the loss values are somewhat strange. The loss values are generally large, and it seems that they are not converging. Especially, the value of loss_saliency exceeds 4, and the value of loss_label shows an increasing trend in 40 evaluations. I would like to ask if you have encountered such a problem in your work, whether the loss values are normal, and why they are slightly large. I attach the "eval.log.txt" file in my reproducing results as follows:
eval.log.txt

What does parameter "use_tef" mean?

Hello,
Thanks for the great work.I would like to ask what this parameter"use_tef" represents. Why concat "tef" with "model_inputs["video_feat"] " if "use_tef" is set True. The code on line 97 of file "QD-DETR/qd_detr/start_end_dataset.py" shows below:
if self.use_tef:
tef_st = torch.arange(0, ctx_l, 1.0) / ctx_l
tef_ed = tef_st + 1.0 / ctx_l
tef = torch.stack([tef_st, tef_ed], dim=1) # (Lv, 2)
if self.use_video:
model_inputs["video_feat"] = torch.cat(
[model_inputs["video_feat"], tef], dim=1) # (Lv, Dv+2)
else:
model_inputs["video_feat"] = tef
best,
Eason

Can not completely reproduce reported video-only results on QVHighlights with the default configs

Hello, I tried to reproduce the video-only results with this official source codes, but got a weaker results as follows:

[email protected]: 59.99
[email protected]: 41.31
test_MR-full-mAP: 36.60
[email protected]: 60.45
[email protected]: 35.78
test_MR-long-mAP: 44.37
test_MR-middle-mAP: 36.94
test_MR-short-mAP: 7.32
test_HL-min-VeryGood-mAP: 38.56
test_HL-min-Good-mAP: 63.94
test_HL-min-Good-Hit1: 73.28
test_HL-min-Fair-mAP: 74.76
test_HL-min-VeryGood-Hit1: 61.54
test_HL-min-Fair-Hit1: 75.10
[email protected]: 61.94
[email protected]: 44.06
val_MR-full-mAP: 38.86
[email protected]: 61.13
[email protected]: 39.06
val_MR-long-mAP: 44.46
val_MR-middle-mAP: 41.29
val_MR-short-mAP: 7.31
val_HL-min-VeryGood-mAP: 39.33
val_HL-min-Good-mAP: 63.67
val_HL-min-Good-Hit1: 73.42
val_HL-min-Fair-mAP: 74.61
val_HL-min-VeryGood-Hit1: 62.90
val_HL-min-Fair-Hit1: 75.42

train.sh script is nearly the same as given:

dset_name=hl
ctx_mode=video_tef
v_feat_types=slowfast_clip
t_feat_type=clip 
results_root=results
exp_id=bs32_baseline

######## data paths
train_path=data/highlight_train_release.jsonl
eval_path=data/highlight_val_release.jsonl
eval_split_name=val

######## setup video+text features
feat_root=./features

# video features
v_feat_dim=0
v_feat_dirs=()
if [[ ${v_feat_types} == *"slowfast"* ]]; then
  v_feat_dirs+=(${feat_root}/slowfast_features)
  (( v_feat_dim += 2304 ))  # double brackets for arithmetic op, no need to use ${v_feat_dim}
fi
if [[ ${v_feat_types} == *"clip"* ]]; then
  v_feat_dirs+=(${feat_root}/clip_features)
  (( v_feat_dim += 512 ))
fi

# text features
if [[ ${t_feat_type} == "clip" ]]; then
  t_feat_dir=${feat_root}/clip_text_features/
  t_feat_dim=512
else
  echo "Wrong arg for t_feat_type."
  exit 1
fi

#### training
bsz=32


CUDA_VISIBLE_DEVICES=7 PYTHONPATH=$PYTHONPATH:. python qd_detr/train.py \
--dset_name ${dset_name} \
--ctx_mode ${ctx_mode} \
--train_path ${train_path} \
--eval_path ${eval_path} \
--eval_split_name ${eval_split_name} \
--v_feat_dirs ${v_feat_dirs[@]} \
--v_feat_dim ${v_feat_dim} \
--t_feat_dir ${t_feat_dir} \
--t_feat_dim ${t_feat_dim} \
--bsz ${bsz} \
--results_root ${results_root} \
--exp_id ${exp_id} \
${@:1}

My GPU is NVIDIA A100, PyTorch version is 1.12. What's the matter with my reproduction?

the feature about charades-sta

hi, authors, could you provide the slowfast and clip feature about the charades-sta dataset? I want to train the model on it. thanks!!!

Training on Charades

Excuse me, how does your code train and test on the charades dataset? There are no related commands and information on GitHub, thank you.

ASR-pretrained checkpoints

Hi WonJun,

Do you provide the checkpoints for the ASR-pretrained models? I currently lack the resources to do the ASR pretraining, so I was hoping to be able to test using your checkpoints?

Best,
Noga

About run_on_videos

Hi and appreciate your work!
I find that you have your model qd_detr_ckpt in the run_on_video, and from project page I know maybe I can use the same command in moment_detr to run the run.py. is it true?
And my second question is, how to train my model so it can be used as my_ckpt to replace the model in run_on_video, that is to say, how to train my own model (use your qd_detr train method) to be used in "run_on_video"?

a very unreasonable phenomenon

hi, authors! I found a very unreasonable phenomenon. In the test phase, several layers of encoder layers with default initialization parameters are added after the encoder of the model, and the salience accuracy(like HL-min-VeryGood-mAP, Hit1) remains basically unchanged. The model used is provided by the author on github.

run_on_video error

In the file run_on_video/model_utils.py, the import statement for MomentDETR is incorrect.
The import statement is from qd_detr.model import build_transformer, build_position_encoding, MomentDETR, but MomentDETR is not present in qd_detr.model. Instead, QDDETR is present in qd_detr.model which should be used instead of MomentDETR.

But when i use QDDETR instead of MomentDETR,
from qd_detr.model import build_transformer, build_position_encoding, QDDETR as MomentDETR,

then run the run.py, a error happen, how to fix?

RuntimeError: Error(s) in loading state_dict for QDDETR:
        Missing key(s) in state_dict: "global_rep_token", "global_rep_pos", "transformer.t2v_encoder.layers.0.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.0.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.0.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.0.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.0.linear1.weight", "transformer.t2v_encoder.layers.0.linear1.bias", "transformer.t2v_encoder.layers.0.linear2.weight", "transformer.t2v_encoder.layers.0.linear2.bias", "transformer.t2v_encoder.layers.0.norm1.weight", "transformer.t2v_encoder.layers.0.norm1.bias", "transformer.t2v_encoder.layers.0.norm2.weight", "transformer.t2v_encoder.layers.0.norm2.bias", "transformer.t2v_encoder.layers.0.activation.weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.1.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.1.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.1.linear1.weight", "transformer.t2v_encoder.layers.1.linear1.bias", "transformer.t2v_encoder.layers.1.linear2.weight", "transformer.t2v_encoder.layers.1.linear2.bias", "transformer.t2v_encoder.layers.1.norm1.weight", "transformer.t2v_encoder.layers.1.norm1.bias", "transformer.t2v_encoder.layers.1.norm2.weight", "transformer.t2v_encoder.layers.1.norm2.bias", "transformer.t2v_encoder.layers.1.activation.weight", "transformer.encoder.layers.0.activation.weight", "transformer.encoder.layers.1.activation.weight", "transformer.decoder.layers.0.sa_qcontent_proj.weight", "transformer.decoder.layers.0.sa_qcontent_proj.bias", "transformer.decoder.layers.0.sa_qpos_proj.weight", "transformer.decoder.layers.0.sa_qpos_proj.bias", "transformer.decoder.layers.0.sa_kcontent_proj.weight", "transformer.decoder.layers.0.sa_kcontent_proj.bias", "transformer.decoder.layers.0.sa_kpos_proj.weight", "transformer.decoder.layers.0.sa_kpos_proj.bias", "transformer.decoder.layers.0.sa_v_proj.weight", "transformer.decoder.layers.0.sa_v_proj.bias", "transformer.decoder.layers.0.ca_qcontent_proj.weight", "transformer.decoder.layers.0.ca_qcontent_proj.bias", "transformer.decoder.layers.0.ca_qpos_proj.weight", "transformer.decoder.layers.0.ca_qpos_proj.bias", "transformer.decoder.layers.0.ca_kcontent_proj.weight", "transformer.decoder.layers.0.ca_kcontent_proj.bias", "transformer.decoder.layers.0.ca_kpos_proj.weight", "transformer.decoder.layers.0.ca_kpos_proj.bias", "transformer.decoder.layers.0.ca_v_proj.weight", "transformer.decoder.layers.0.ca_v_proj.bias", "transformer.decoder.layers.0.ca_qpos_sine_proj.weight", "transformer.decoder.layers.0.ca_qpos_sine_proj.bias", "transformer.decoder.layers.0.cross_attn.out_proj.weight", "transformer.decoder.layers.0.cross_attn.out_proj.bias", "transformer.decoder.layers.0.activation.weight", "transformer.decoder.layers.1.sa_qcontent_proj.weight", "transformer.decoder.layers.1.sa_qcontent_proj.bias", "transformer.decoder.layers.1.sa_qpos_proj.weight", "transformer.decoder.layers.1.sa_qpos_proj.bias", "transformer.decoder.layers.1.sa_kcontent_proj.weight", "transformer.decoder.layers.1.sa_kcontent_proj.bias", "transformer.decoder.layers.1.sa_kpos_proj.weight", "transformer.decoder.layers.1.sa_kpos_proj.bias", "transformer.decoder.layers.1.sa_v_proj.weight", "transformer.decoder.layers.1.sa_v_proj.bias", "transformer.decoder.layers.1.ca_qcontent_proj.weight", "transformer.decoder.layers.1.ca_qcontent_proj.bias", "transformer.decoder.layers.1.ca_kcontent_proj.weight", "transformer.decoder.layers.1.ca_kcontent_proj.bias", "transformer.decoder.layers.1.ca_kpos_proj.weight", "transformer.decoder.layers.1.ca_kpos_proj.bias", "transformer.decoder.layers.1.ca_v_proj.weight", "transformer.decoder.layers.1.ca_v_proj.bias", "transformer.decoder.layers.1.ca_qpos_sine_proj.weight", "transformer.decoder.layers.1.ca_qpos_sine_proj.bias", "transformer.decoder.layers.1.cross_attn.out_proj.weight", "transformer.decoder.layers.1.cross_attn.out_proj.bias", "transformer.decoder.layers.1.activation.weight", "transformer.decoder.query_scale.layers.0.weight", "transformer.decoder.query_scale.layers.0.bias", "transformer.decoder.query_scale.layers.1.weight", "transformer.decoder.query_scale.layers.1.bias", "transformer.decoder.ref_point_head.layers.0.weight", "transformer.decoder.ref_point_head.layers.0.bias", "transformer.decoder.ref_point_head.layers.1.weight", "transformer.decoder.ref_point_head.layers.1.bias", "transformer.decoder.bbox_embed.layers.0.weight", "transformer.decoder.bbox_embed.layers.0.bias", "transformer.decoder.bbox_embed.layers.1.weight", "transformer.decoder.bbox_embed.layers.1.bias", "transformer.decoder.bbox_embed.layers.2.weight", "transformer.decoder.bbox_embed.layers.2.bias", "transformer.decoder.ref_anchor_head.layers.0.weight", "transformer.decoder.ref_anchor_head.layers.0.bias", "transformer.decoder.ref_anchor_head.layers.1.weight", "transformer.decoder.ref_anchor_head.layers.1.bias", "saliency_proj1.weight", "saliency_proj1.bias", "saliency_proj2.weight", "saliency_proj2.bias".
        Unexpected key(s) in state_dict: "saliency_proj.weight", "saliency_proj.bias", "transformer.decoder.layers.0.multihead_attn.in_proj_weight", "transformer.decoder.layers.0.multihead_attn.in_proj_bias", "transformer.decoder.layers.0.multihead_attn.out_proj.weight", "transformer.decoder.layers.0.multihead_attn.out_proj.bias", "transformer.decoder.layers.0.self_attn.in_proj_weight", "transformer.decoder.layers.0.self_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.in_proj_weight", "transformer.decoder.layers.1.multihead_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.out_proj.weight", "transformer.decoder.layers.1.multihead_attn.out_proj.bias", "transformer.decoder.layers.1.self_attn.in_proj_weight", "transformer.decoder.layers.1.self_attn.in_proj_bias".
        size mismatch for query_embed.weight: copying a param with shape torch.Size([10, 256]) from checkpoint, the shape in current model is torch.Size([10, 2]).

what does parameter 'clip_len' mean?

hi, authors, great works, now I want to train the model on the charades-sta dataset. After I read the issues in jayleicn/moment_detr#11 and #1, I find that there are different values about 'clip_len', in the moment_detr method, the author set the 'clip_len' is 2, and in the QD-deter method, you set the 'clip_len' is 1, so I am confused, what does the 'clip_len' mean?

about feature extract

1.To use the pretrain model, could you please share the script for slowfast feature extraction?
2.Are the parameters for extracting CLIP features exactly the same as the moment-detr?

About ablation study

Thanks for your work, I have some questions on the ablation study.

  1. In the Table4 in the paper, which is the Ablation study on QVHighlights val split, is the (a) row Moment-detr? and if so, is the (b) row that Moment-detr simply with cross-attentive transformer encoder ?
  2. What should I do If I want to do the same ablation experiment like (b) row , that is only using CATE?

Test Result

Hello, after I get the hl_test_submission.jsonl file, how can I get my model accuracy?

Eval

Excuse me, CodaLab can only upload 5 times, how to evaluate the results of V+A?

TVSUM Result

Excuse me, I found that there are five random seed values in the tvum code. After I reproduced your code, I found that the five results were very different. I would like to ask how the results in the paper were selected.

Missing umt_clip_text_features and umt_pann_features

Hi WonJun,

The qd_detr/scripts/train_audio.sh script makes use of features/umt_clip_text_features and features/umt_pann_features. However, these are missing in the moment_detr_features.tar.gz file. Could you upload these features as well, please?

Best,
Noga

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.