beyondguo / genius Goto Github PK

💡GENIUS – generating text using sketches! A strong text generation & data augmentation tool.

Python 79.64% Jupyter Notebook 16.67% Shell 3.69%

conditional-text-generation data-augmentation keywords-to-text text-augmentation sketch-to-text text-generation named-entities-recognition text-classificaiton

genius's Introduction

Hi there 👋

This is Biyang Guo!

Homepage: Beyond@SimpleAI

I'm a PhD student at SUFE. My research interests lie in NLP (Natural Language Processing) and DCAI (Data-Centric AI).

Here are some research projects I'm in charge of:

Project	Paper	Description
LLM-Tuning	--	Tuning LLMs without tears.
ChatGPT-Comparison-Detection	LLM@IJCAI-23	The first Human-ChatGPT comparison corpus and detection tools.
GENIUS: Generating text using sketches!	Arxiv 2023	A novel pre-training model for sketch-based text generation and data augmentation
Selective Text Augmentation	Arxiv 2022	A simple but effective data augmentation method based on Word Roles
Label Confusion Learning	AAAI 2021	Label confusion learning for more robust model training

genius's People

Contributors

Stargazers

Watchers

Forkers

taishan1994 muximuxi qiwangzhixinxin eight-corner zhangnn520 vpegasus techthiyanes researchoor macielyoung xiaoanshi faiail fangcao1314 adambear babyblue26 zetangforward weilitchi k-nakam

genius's Issues

请问如何继续提高生成质量？

使用的genius-base-chinese

from transformers import pipeline
genius = pipeline("text2text-generation", model=r'genius-base-chinese', device=0)
sketch = "学生[MASK]作文[MASK]感受"
generated_text = genius(sketch, num_beams=1, do_sample=True, max_length=200)[0]['generated_text']
generated_text = generated_text.replace(' ', '')
print(generated_text)

生成结果：
学生在作文中，有感受要先写出自己的理解、感受、观点，再看自己所作用的作品。
学生在阅读和完成作文之间产生了很多的联系和感受
学生在作文中，不仅是感受美国，我觉得也有大量东西可写，感受
学生对自己的作文有丰富的感受，这样就能够给予孩子更深刻的认识和感受
学生通过作文表现出自己对于生活及教育的认识及感受

请问生成质量如果希望提升到更好的效果，可以从哪些方面进行着手改进？
语料规模？
模型规模？
能够具体介绍下？

针对于中文数据genius_utils.py里面template==4的一个小问题

针对中文，有个问题是在genius_utils.py里面template==4时：
输入：银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足，有非常抢眼！
得到：银色罗马高跟鞋圆球吊饰耳饰单带个性十足[MASK]抢眼[MASK]
也就是当两个关键词中间隔一个字时会省略掉。
不知道是故意如此还是有点问题，如果改成：

if sep == '' and id - all_ids[i - 1] == 2: # a space in between
    masked_text.append(s[id-1])

得到：银色的罗马高跟鞋，圆球吊饰耳饰单带，个性十足[MASK]抢眼[MASK]

fnlp/bart-large-chinese中的mask字符？

楼主你好！我在处理中文预训练数据时，发现原有的mask是''，对于fnlp/bart-base-chinese和fnlp/bart-large-chinese来讲，这个mask是不是应该为'[MASK]'?

在自己的数据集上fine tuning

您好，在git上没有看到fine tune相关代码，请问在自己的数据集上fine tune使用预训练的代码就可以了吗？

code output from scratch

在huggingface space （https://huggingface.co/spaces/beyond/genius）上 English tab下输入
print("hello world")

输出为：
print"print""hello world""print""print" href"http://www.printmedia.com/printmedia/print/print" print"print:"hello world"print"printed"print/"print"Print"print"'print:"print""printed""print''print"prints"print "print" Print"print "print""Print" print""print","print" "print''Print""print;"print"/"print"" print"Print""Print"" Print""print ""print"" Print"Print "print", "print, ""print"

这是因为数据集里面不包含代码的缘故么？

中文开源模型支持“属性控制”的能力么

如题，现在中文模型可以通过加prompt来控制生成文本的风格么，支持的话具体包括哪些风格呀

你好，中文预训练时计算评价指标均为0，请问是否有问题，若有问题，已提交一个requests。

请问genius-large-k2t模型不开源吗？我本地跑报错

Genius-f logs for NER

Hi there! Great repo. Would you mind sharing the logs for Genius-f fine-tuning for the NER use case? Or would you mind giving an estimate of the final rouge metrics and how much time it took to converge?

Thank You!

genius-large-k2t model not found

Where can I try the k2t model? Thanks!

How to finetune on custom datasets

Nice job. But what's the best way to finetune on my own datasets? Use examples in pre_training/ or something else？

LAC rank抽取语句不重要成分

非常好的想法，感觉预训练模型抽取不重要语句成分可以考虑用一下百度LAC的rank方法，给出句子中每个词语的重要程度（https://github.com/baidu/lac）。我尝试了你提到的yake包，感觉对中文好像不太友好😂，也有可能我用的不太对。
我有考虑过遮盖一些重要词再利用Bert或者T5类的模型生成去构造增强对比样本，训练无监督语义表征，不过目前效果不是很好。感觉可以利用你这类的方法作为一个语句增强样本再试试。

中文语料的构成？

您好，您的项目非常赞！我想了解下，您的中文数据的构成如何？我想在您所提供的语料的基础上再增加一些作文和小说的语料，不知道是否和您现有的语料有重合，谢谢~

train QA dataset without context,how can I do it?

If I want to train QA model and the QA dataset withdout context,how can I process it?

关于中文数据增强

您好，我想咨询一下关于使用中文文本数据增强，augmentation_clf文档中有一个genius增强文档，只修改了模型（beyond/genius-base-chinese）还需要需要别的地方吗？

用自己的文本作训练出现的问题

自己使用文本作完处理后,进行预训练,在tokenized_dataset = dataset_with_sketch['train'].select(random.sample(range(5000),k=N)).map(preprocess_function, batched=True,
remove_columns=dataset_with_sketch['train'].column_names,
batch_size=10000,num_proc=25),这一块出现问题,显示没有train这一列,是我的数据集问题还是代码问题

Pre-training

Can you provide documentation to perform the pre-training on custom datasets?
How is the data formatted and what are the appropriate parameters for the scripts in the ./pre_training directory?

Thank you!

数据生成

您好,我的任务是直接通过sketch生成文本,我用自己的文本数据训练了模型,但是看到的都是clf和ner的数据增强,请问在哪可以找到对应的数据增强代码可以完成这一任务?