Light

wisdomify / storyteller Goto Github PK

View Code? Open in Web Editor NEW

3.0 1.0 0.0 309 KB

An Elsaticsearch-powered forward-dictionary of Korean proverbs

Python 70.79% Jupyter Notebook 29.21%

elasticsearch forward-dictionary education idioms proverbs language

storyteller's Introduction

storyteller

.env

This project controls secrets such as user id and password or API tokens via .env file on the root directory of the project.

.env file must have format like followings.

es_user={elasticsearch user name}
es_password={elasticsearch user password}

Downloading data

Indexing a corpus

python3 -m storyteller.main.index --i="gk" --b=1500

Searching a phrase

python3 -m  storyteller.main.index --wisdom="산 넘어 산"

Building wisdom2def or wisdom2eg

storyteller's People

Contributors

Stargazers

Watchers

storyteller's Issues

preprocess.normalise 구현하기

What?

한국어 맞춤법 정규화
한국어 띄어쓰기 정규화
한국어 이모티콘 정규화 (ㅋㅋㅋ, ㅎㅎㅎ 등)

고려대 코퍼스로 수집한 데이터도 ES에 업로드하기

TL;DR

기존에 수집한 데이터도 활용해보자.

WHY?

기존에 수집한 자료도 유의미하다.

WHAT?

고로 막 버리지 말고 ES에 또다른 인덱스로 올리자

TODOs

기존에 수집했던 데이터 다시 로컬에 모으기
tsv로드해서 json으로 변환하는 로직 넣기
tsv to json 스크립트로 기존 tsv를 elastic에 저장할 수 있는 form으로 변환하기
변환된 json ES에 indexing하기

preprocess.upsample 구현하기

What?

가장 데이터 개수가 많은 클래스에 맞춰서, 상대적으로 개수가 적은 클래스의 데이터를 upsample하는 함수

aihub data upload (홀수 번째)

TL;DR

aihub에 공개되어 있는 데이터 중에 홀수 번째 데이터의 업로드.

TODOs

Upload ultimate test dataset to W&B

뉴스에서 속담 데이터 수집 & 인덱스

What?

빅카인즈 api를 사용해서, wisdoms에 포함된 모든 속담의 용례를 가능한 다 찾아낸다.

Why?

보통 뉴스에 속담의 용례가 많이 포함되어 있어서.

To-do's

api 요청이 잘되는지 확인하기
뉴스 데이터 수집하기 (구문 검색이 정확하지 않다면, 그냥 토큰으로 쪼개서, 불리언 검색 활용)
그리고 News(Story) 구현
python3 -m storyteller.main.index --index=news_story 인덱싱까지 마무리하기!

ai-hub 짝수번째 데이터 전처리 및 업로드

TL;DR

WHY?

왜 업로드 하는지?

ES에 올릴 데이터를 많이 긁어모아야 한다! ai-hub에는 "속담이 있을 법한" 데이터들이 여럿 보이는데, 이를 일단 모조리 올린 후 속담이 있는 데이터만 사용하려 함.

왜 전처리 해야하는지?

우리가 사용하는 폼대로 모두 sentence에 한 문장만 있으면 좋지만.. 현실은 그리 녹록치 않더라. 각 데이터에 맞게 업로드 하는 방법을 명시해두어야 나중에 가져올 일이 있다면 편리하게 가져올 수 있을 듯 하다.

WHAT?

ai-hub의 자연어/음성 데이터에 짝수 번째 데이터 중, 속담이 있을 법한 데이터를 긁어 각 데이터에서 문장만 가져올 수 있는 방법을 정리한다.
또한, 전처리한 데이터를 ES cloud에 bulk api를 이용하여 업로드한다.

TODOs

그냥 데이터 url을 코드에 하드코딩하기

What?

Why?

edit권한만 막아놓으면, 아무나 링크를 타고 보고 들어올 수 있도록 해놔도 괜찮다. 링크는 하드코딩을하자.

Bring essentials from legacy

artifacts.py 추가 (데이터 로드 & 빌드용)

What?

Why?

To-do's

wandb Tables 적용하기

What?

wandb Tables를 활용하여 builders.py 완성하기.

Why?

wandb에서 제공하는 Tables 객체를 활용하면, 굳이 로컬에다가 tsv파일을 저장하지 않아도, 바로 데이터를 wandb에 푸시할 수 있다. 접근을 하는 방법도 마찬가지. 굳이 path를 지정할 필요가 없어, 훨씬 더 간결해진다는 점이 장점.

To-do's

Tables 문서 리뷰하기
augment , parse 함수를 추가하기 - Wisdom2EgUploader의 경우에 한함!, normalise 함수 구현하기
elastic/utils.py -> elastic/crud.py
builders.py 추가

preprocess.parse 구현하기

preprocess.cleanse 구현하기

Refactoring the repository

0. Download the data and define the path

예를 들면, 이렇게;

1. defining the indices

다음의 Document class를 storyteller/elastic/docs.py 에 정의한다. 구현해야하는 것은 총 두가지:

sents외에 추가해야하는 필드
stream_from_corpus(): 말뭉치 데이터를 파싱하여 해당 Doc의 객체를 stream.
Index 메타 클래스.

예를 들면, 감성대화의 경우 다음과 같이 정의:

class SC(Story):
    """
    감성 대화 인덱스
    """
    # --- additional fields for SC --- #
    profile_id = Keyword()
    talk_id = Keyword()

    @staticmethod
    def stream_from_corpus() -> Generator['SC', None, None]:
        train_json_path = os.path.join(SC_DIR, "Training", "감성대화말뭉치(최종데이터)_Training.json")
        val_json_path = os.path.join(SC_DIR, "Validation", "감성대화말뭉치(최종데이터)_Validation.json")

        for json_path in (train_json_path, val_json_path):
            with open(json_path, 'r') as fh:
                corpus_json = json.loads(fh.read())
                for sample in corpus_json:
                    yield SC(sents=" ".join(sample['talk']['content'].values()),
                             profile_id=sample['talk']['id']['profile-id'],
                             talk_id=sample['talk']['id']['talk-id'])

    class Index:
        # 해당 말뭉치의  인덱스 이름
        name = "sc_story"
        settings = Story.settings()

2. index

일단 1번을 끝내면, 인덱싱을 하는 것은 다음의 명령어로 바로 진행이 가능함.

python3 -m storyteller.main.index --index=gk_story  # indexing 일반상식 말뭉치
python3 -m storyteller.main.index --index=sc_story  # indexing  감성대화 말뭉치
python3 -m storyteller.main.index --index=mr_story # indexing  기계대화 말뭉치

3. search

이 부분은 아직 정확한 검색 로직은 미완성. 하지만 어느정도 검색은 가능함. 더 정확한 검색은 storyteller/elasitc/searcher.py 에 정의된 Searcher 클래스를 수정해야함.

python3 -m storyteller.main.search --wisdom="산 넘어 산"

4. build

storyteller가 관리하는 wandb artifacts는 다음과 같다:

wisdoms
wisdomify_test
wisdom2def
widom2eg

artifact를 wandb에 업로드 전, 먼저 각 artifact의 파일 및 디렉토리를 다음과 같이 로컬에 빌드한다:

data
├── corpora
└── wandb
    ├── artifacts
    │   ├── wisdom2def
    │   │   ├── wisdom2def.tsv
    │   │   ├── wisdom2def_raw.tsv
    │   │   ├── wisdom2def_train.tsv
    │   │   └── wisdom2def_val.tsv
    │   ├── wisdom2eg
    │   │   ├── wisdom2eg.tsv
    │   │   ├── wisdom2eg_raw.tsv
    │   │   ├── wisdom2eg_train.tsv
    │   │   └── wisdom2eg_val.tsv
    │   ├── wisdomify_test.tsv
    │   └── wisdoms.txt

이를 위한 스크립트는 다음과 같다:

python3 -m storyteller.main.build --artifact_name="wisdoms"
python3 -m storyteller.main.build --artifact_name="wisdomify_test"
python3 -m storyteller.main.build --artifact_name="wisdom2def"
python3 -m storyteller.main.build --artifact_name="wisdom2eg"

5. upload

일단 빌드가 마무리되면, 다음의 스크립트로 wandb에 업로드가 가능하다.

python3 -m storyteller.main.upload --artifact_name="wisdoms"
python3 -m storyteller.main.upload --artifact_name="wisdomify_test"
python3 -m storyteller.main.upload --artifact_name="wisdom2def"
python3 -m storyteller.main.upload --artifact_name="wisdom2eg"

Download??

말뭉치를 다운로드 하는 것은, 일단 보류. GCP 다운로더도 일단 제거했다. 물론 필요없다는 의미가 아니다. 지금 사용이 불가능해서, 잠시 제거 한 것 일 뿐. 추후에 다운로드 로직은 따로 추가가 필요할 듯.

리팩토링 - metaflow integration & versioning nomenclature

Why?

Data processing = DAG!

To-do's

FlowSpec Skeleton 정의

GCP cloud에서 데이터 불러오는 support util 추가

TL;DR

GCP cloud에서 데이터 불러오는 support util 추가

WHY?

raw 데이터를 지속적으로 누군가가 계속 다운로드 받기보다는 GCP에 올라가있는 raw데이터를 다운받아 전처리하는 게 좋아 보인다.

WHAT?

GCP 클라우드에서 데이터 불러오고 zip파일 푼 뒤
해당 루트 디렉토리 이름, 디렉토리 구조, 파일 리스트 출력하는 함수 및 클래스 작성.

TODOs

wisdomify_test의 일부분을 validation으로 사용하는 것으로 변경하기

What?

wisdom2def, wisdom2eg -> raw, all로만 분할.
wisdomify_test -> raw, all, test, val로 분할.
이렇게 하면, early stopping으로 에폭을 최적화하는 것이 가능하다.
validation set의 비율은 20%로 하자. 100개 중 20개를 validation에 사용한다.

Why?

최종적으로 풀고자하는 문제는 wisdom2test인데, validation셋을 wisdom2def의 일부분으로 사용해버리면, 최종문제에 대하여
언제 오버피팅이 되는지 확인할 수 있는 길이 없다.

To-do's - the tasks

wisdom2test -> wisdomify_test 이름 재변경
이게 더 적절할듯 -> wisdomify_test - >wisdom2query
stratified_split 구현하기
wisdom2def, wisdom2eg -> raw, all로만 분할하기
wisdomify_test -> raw, all, val, test로만 분할하기

To -do's - some simple refactoring

각 artifact별로, 중요한 메타데이터 추가하기 (e.g. wisdomify_test의 경우 -> val_ratio, seed)
~~[ ] flatten out the data directory (data/corpora, data/wandb -> copora, wandb) - flat is better than nested~~ 폴더가 알파벳 순서로 정렬... 오히려 더 지저분해짐. 그냥 냅두자.

preprocess 구현하기

What?

전처리 로직 구현하기. TODO 표기된 부분만 구현하면 된다!

storyteller/storyteller/preprocess.py

Lines 6 to 35 in 501a4de

    
           def augment(df: pd.DataFrame) -> pd.DataFrame: 
        
               # TODO implement augmentation. 
        
               return df 
        
           def parse(df: pd.DataFrame) -> pd.DataFrame: 
        
               """ 
        
               parse <em> ...</em>  to [WISDOM]. 
        
               :param df: 
        
               :return: 
        
               """ 
        
               # TODO: implement parsing 
        
               return df 
        
           def normalise(df: pd.DataFrame) -> pd.DataFrame: 
        
               """ 
        
               1. normalise the emoticons. 
        
               2. normalise the spacings. 
        
               3. normalise grammatical errors. 
        
               :param df: 
        
               :return: 
        
               """ 
        
               # TODO: implement normalisation 
        
               return df 
        
           def upsample(df: pd.DataFrame) -> pd.DataFrame: 
        
               # TODO: implement upsampling 
        
               return df

Why?

...

To-do's

preprocess.augment 구현하기

What?

How?

해당 라이브러리 코드를 잘 활용해보면.. 할 수 있을 것 같다. konlpy에 의존하지도 않으므로, jvm도 설치할 필요가 없음.
https://github.com/catSirup/KorEDA

GCP storage download unzip feature

TL;DR

GCP storage에서 데이터 다운로드후 unzip까지 완료할 수 있게 하려고 함.

WHY?

지금 있는 클래스는 zip파일을 다운로드만 받는다.
이러면 zip파일을 unzip하는 프로세스를 반복적으로 작성하게 될 것 같다.

WHAT?

GCPStorage 클래스에 unzipping하는 메소드를 추가하자.

TODOs

unzipping 메소드 추가
zip 저장 위치 디렉토리 생성

storyteller 검색용 플라스크 api 구축

What?

reverse search를 했을 때 속담의 정의와 다른 eg/source를 보여줄 검색 엔진을 위한 플라스크 api 구축.

Why?

속담을 검색했을 때 이 용례들이 어떻게 사용되는지 도 확인할 필요가 있다!

TODO

검색 결과 중 몇 개를 보여줄 건지 정하기.
- 모두 보여준다면, 한 페이지에 몇 개를 보여줄 건지 정하기.
속담이 입력되면 용례 보여주는 flask api 작성.

wandb all dataset structure change

TL;DR

current: train, validate, test

changed: train + validate → train.tsv / test → validate.tsv
test data is now called from ultimate test queries.

WHY?

ultimate test dataset is added to wandb. Now train + validate should be train set and test should be validate.

WHAT?

changed: train + validate → train.tsv / test → validate.tsv

TODOs

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	def augment(df: pd.DataFrame) -> pd.DataFrame:
	# TODO implement augmentation.
	return df


	def parse(df: pd.DataFrame) -> pd.DataFrame:
	"""
	parse <em> ...</em> to [WISDOM].
	:param df:
	:return:
	"""
	# TODO: implement parsing
	return df


	def normalise(df: pd.DataFrame) -> pd.DataFrame:
	"""
	1. normalise the emoticons.
	2. normalise the spacings.
	3. normalise grammatical errors.
	:param df:
	:return:
	"""
	# TODO: implement normalisation
	return df


	def upsample(df: pd.DataFrame) -> pd.DataFrame:
	# TODO: implement upsampling
	return df