Code Monkey home page Code Monkey logo

mmd-mp's Introduction

Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

Official PyTorch implementation of the ICLR 2024 paper:

Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li, Bo Han, Mingkui Tan.

Abstract: Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading informa- tion, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit maximum mean discrepancy (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly in- creased variance of MMD since MGTs may contain multiple text populations due to various LLMs. This will severely impair MMD’s ability to measure the differ-ence between two samples. To tackle this, we propose a novel multi-population aware optimization method for MMD called MMD-MP, which can avoid variance increases and thus improve the stability to measure the distributional discrep- ancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, e.g., GPT2 and ChatGPT, show superior detection performance of our MMD-MP.

Requirements

  • An NVIDIA RTX graphics card with 12 GB of memory.
  • Python 3.7
  • Pytorch 1.13.1

Data and pre-trained models

For dataset, we use HC3, which can be downloaded by download link. For the pre-trained language models, you need to first download them from the following links:

After the download, please complete the model_path_dit in the run file.

Environment of MMD-MP

You have to create a virtual environment and set up libraries needed for training and evaluation.

conda env create -f detectGPT.yml

Run experiments on HC3

Training MMD-MP.

  • Select the best model through best_power:
CUDA_VISIBLE_DEVICES=0 \
python run_meta_mmd_trans.py \ 
--id 10001 \ 
--sigma0 55 \ 
--lr 0.00005 \ 
--no_meta_flag \   
--n_samples 3900 \ 
--target_senten_num 3000 \ 
--val_num 50 \ 
--sigma 30 \ 
--max_length  100 \ 
--trial_num 3 \ 
--num_hidden_layers 1 \ 
--target_datasets HC3 \ 
--text_generated_model_name chatGPT \ 
--base_model_name roberta-base-openai-detector \ 
--skip_baselines \ 
--mask_flag \ 
--transformer_flag \ 
--meta_test_flag \ 
--epochs 100 \ 
--two_sample_test \
--is_yy_zero \
  • Select the best model through best_auroc:
CUDA_VISIBLE_DEVICES=1 \ 
python run_meta_mmd_trans_auroc.py \ 
--id 10002 \ 
--sigma0 40 \ 
--lr 0.00005 \  
--no_meta_flag \   
--n_samples 3900 \ 
--target_senten_num 3000 \ 
--val_num 50 \ 
--sigma 30 \ 
--max_length 100 \ 
--trial_num 3 \ 
--num_hidden_layers 1 \
--target_datasets HC3 \ 
--text_generated_model_name chatGPT \ 
--base_model_name roberta-base-openai-detector \ 
--skip_baselines \ 
--mask_flag \ 
--transformer_flag \ 
--meta_test_flag \ 
--epochs 100 \ 
--two_sample_test \
--is_yy_zero \

Testing MMD-MP.

  • Add a command-line argument --test_flag to enable the testing functionality, allowing for the evaluation of the checkpoint corresponding to the specified id:
CUDA_VISIBLE_DEVICES=0 \ 
python run_meta_mmd_trans.py \ 
--test_flag \
--id 10001 \ 
--sigma0 55 \ 
--lr 0.00005 \  
--no_meta_flag \   
--n_samples 3900 \ 
--target_senten_num 3000 \ 
--val_num 50 \ 
--sigma 30 \ 
--max_length  100 \ 
--trial_num 3 \ 
--num_hidden_layers 1 \ 
--target_datasets HC3 \ 
--text_generated_model_name chatGPT \ 
--base_model_name roberta-base-openai-detector \ 
--skip_baselines \ 
--mask_flag \ 
--transformer_flag \ 
--meta_test_flag \
--epochs 100 \ 
--two_sample_test \

Citation

@inproceedings{zhangs2024MMDMP,
  title={Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy},
  author={Zhang, Shuhai and Song, Yiliao and Yang, Jiahao and Li, Yuanqing and Han, Bo and Tan, Mingkui},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year={2024}
}

mmd-mp's People

Contributors

zshsh98 avatar

Stargazers

 avatar Jiani Xie avatar lizhaoliu avatar ran luo avatar  avatar  avatar lyy avatar  avatar YifYoung avatar Davy Chen avatar w zhang avatar chenliangwei avatar  avatar  avatar  avatar  avatar Jinwu Hu avatar Zhenyu Li avatar  avatar  avatar  avatar zengjinghui avatar  avatar Qianyue avatar Luo avatar  avatar Lance Huang avatar Kwanyeung Lam avatar bigeyes avatar  avatar KaiZhou-cs avatar Hailong Kang avatar Xianghui Ruan avatar  avatar ICE avatar 陈国昊 avatar  avatar

Watchers

 avatar

mmd-mp's Issues

OSError: No such device (os error 19) When Loading Model with Transformers

Token indices sequence length is longer than the specified maximum sequence length for this model (749 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "run_meta_mmd_trans.py", line 925, in
base_model, base_tokenizer = load_base_model_and_tokenizer(args.base_model_name)
File "run_meta_mmd_trans.py", line 531, in load_base_model_and_tokenizer
base_model = transformers.AutoModelForCausalLM.from_pretrained(model_path_dit[name])
File "/root/miniconda3/envs/detectGPT/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 485, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
File "/root/miniconda3/envs/detectGPT/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2604, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "/root/miniconda3/envs/detectGPT/lib/python3.7/site-packages/transformers/modeling_utils.py", line 450, in load_state_dict
with safe_open(checkpoint_file, framework="pt") as f:
OSError: No such device (os error 19)

The default of the "text_generated_model_name"

The default of the original parser's of the argument "text_generated_model_name" is "gpt2", and it would have error when we try to traversal it, like key 'g' is not exist. So I assume it should be as an element, like alter to : default = ['gpt2'] ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.