Code Monkey home page Code Monkey logo

genius's Introduction

Hi there 👋

This is Biyang Guo!

Homepage: Beyond@SimpleAI

I'm a PhD student at SUFE. My research interests lie in NLP (Natural Language Processing) and DCAI (Data-Centric AI).

Here are some research projects I'm in charge of:

Project Paper Description Code
LLM-Tuning -- Tuning LLMs without tears.
ChatGPT-Comparison-Detection LLM@IJCAI-23 The first Human-ChatGPT comparison corpus and detection tools.
GENIUS: Generating text using sketches! Arxiv 2023 A novel pre-training model for sketch-based text generation and data augmentation
Selective Text Augmentation Arxiv 2022 A simple but effective data augmentation method based on Word Roles
Label Confusion Learning AAAI 2021 Label confusion learning for more robust model training

genius's People

Contributors

beyondguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

genius's Issues

请问如何继续提高生成质量?

使用的genius-base-chinese

from transformers import pipeline
genius = pipeline("text2text-generation", model=r'genius-base-chinese', device=0)
sketch = "学生[MASK]作文[MASK]感受"
generated_text = genius(sketch, num_beams=1, do_sample=True, max_length=200)[0]['generated_text']
generated_text = generated_text.replace(' ', '')
print(generated_text)

生成结果:
学生在作文中,有感受要先写出自己的理解、感受、观点,再看自己所作用的作品。
学生在阅读和完成作文之间产生了很多的联系和感受
学生在作文中,不仅是感受美国,我觉得也有大量东西可写,感受
学生对自己的作文有丰富的感受,这样就能够给予孩子更深刻的认识和感受
学生通过作文表现出自己对于生活及教育的认识及感受

请问生成质量如果希望提升到更好的效果,可以从哪些方面进行着手改进?
语料规模?
模型规模?
能够具体介绍下?

针对于中文数据genius_utils.py里面template==4的一个小问题

针对中文,有个问题是在genius_utils.py里面template==4时:
输入:银色的罗马高跟鞋,圆球吊饰耳饰单带,个性十足,有非常抢眼!
得到:银色罗马高跟鞋圆球吊饰耳饰单带个性十足[MASK]抢眼[MASK]
也就是当两个关键词中间隔一个字时会省略掉。
不知道是故意如此还是有点问题,如果改成:

if sep == '' and id - all_ids[i - 1] == 2: # a space in between
    masked_text.append(s[id-1])

得到:银色的罗马高跟鞋,圆球吊饰耳饰单带,个性十足[MASK]抢眼[MASK]

fnlp/bart-large-chinese中的mask字符?

楼主你好!我在处理中文预训练数据时,发现原有的mask是'',对于fnlp/bart-base-chinese和fnlp/bart-large-chinese来讲,这个mask是不是应该为'[MASK]'?

code output from scratch

在huggingface space (https://huggingface.co/spaces/beyond/genius)上 English tab下输入
print("hello world")

输出为:
print"print""hello world""print""print" href"http://www.printmedia.com/printmedia/print/print" print"print:"hello world"print"printed"print/"print"Print"print"'print:"print""printed""print''print"prints"print "print" Print"print "print""Print" print""print","print" "print''Print""print;"print"/"print"" print"Print""Print"" Print""print ""print"" Print"Print "print", "print, ""print"

这是因为数据集里面不包含代码的缘故么?

Genius-f logs for NER

Hi there! Great repo. Would you mind sharing the logs for Genius-f fine-tuning for the NER use case? Or would you mind giving an estimate of the final rouge metrics and how much time it took to converge?

Thank You!

LAC rank抽取语句不重要成分

非常好的想法,感觉预训练模型抽取不重要语句成分可以考虑用一下百度LAC的rank方法,给出句子中每个词语的重要程度(https://github.com/baidu/lac)。我尝试了你提到的yake包,感觉对中文好像不太友好😂,也有可能我用的不太对。
我有考虑过遮盖一些重要词再利用Bert或者T5类的模型生成去构造增强对比样本,训练无监督语义表征,不过目前效果不是很好。感觉可以利用你这类的方法作为一个语句增强样本再试试。

中文语料的构成?

您好,您的项目非常赞!我想了解下,您的中文数据的构成如何?我想在您所提供的语料的基础上再增加一些作文和小说的语料,不知道是否和您现有的语料有重合,谢谢~

关于中文数据增强

您好,我想咨询一下关于使用中文文本数据增强,augmentation_clf文档中有一个genius增强文档,只修改了模型(beyond/genius-base-chinese)还需要需要别的地方吗?

用自己的文本作训练出现的问题

自己使用文本作完处理后,进行预训练,在tokenized_dataset = dataset_with_sketch['train'].select(random.sample(range(5000),k=N)).map(preprocess_function, batched=True,
remove_columns=dataset_with_sketch['train'].column_names,
batch_size=10000,num_proc=25),这一块出现问题,显示没有train这一列,是我的数据集问题还是代码问题

Pre-training

Can you provide documentation to perform the pre-training on custom datasets?
How is the data formatted and what are the appropriate parameters for the scripts in the ./pre_training directory?

Thank you!

数据生成

您好,我的任务是直接通过sketch生成文本,我用自己的文本数据训练了模型,但是看到的都是clf和ner的数据增强,请问在哪可以找到对应的数据增强代码可以完成这一任务?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.