Code Monkey home page Code Monkey logo

openbg500's Introduction

OpenBG500

Information

OpenBG500 is an open chinese E-commerce and bussiness knowledge graph dataset contained 500 relations. This dataset is refined from the OpenBG, a million-scale multi-modal dataset evolving products and consumption demands in a unified schema. AliOpenKG500 is developed for several knowledge graph embedding evaluations.

The dataset splits all data into 3 parts. Base statistical information is shown in the table below.

#Relation #Entity #Train (opened) #Valid (opened) #Test
500 249,743 1242550 5000 5000

Data

OpenBG500 is available at Google Drive and Baidu Netdisk(password: 78fw). The main derectory of the dataset is as follows.

OpenBG500
├── OpenBG500_train.tsv 			# Training set
├── OpenBG500_dev.tsv 				# Validation set
├── OpenBG500_test.tsv 			    # Test set
├── OpenBG500_entity2text.tsv 		# Description of entities in Chinese
├── OpenBG500_relation2text.tsv 	# Description of relations in Chinese
└── OpenBG500_example_pred.tsv 	    # Submit example

Usage

Format

  • Triples
# OpenBG500_train.tsv/OpenBG500_dev.tsv
Head<\t>Relation<\t>Tail<\n>
  • Description of entities/relations in Chinese
# OpenBG500_entity2text.tsv/OpenBG500_relation2text.tsv
Entity(Relation)<\t>Description of entitie(relation)<\n>
  • Test and submit
# For OpenBG500_test.tsv, participants are required to predict 10 Tails for one instance. OpenBG500_example_pred.tsv is a submit example.
Head<\t>Relation<\n>

# OpenBG500_example_pred.tsv
Head<\t>Relation<\t>Tail 1<\t>Tail 2<\t>...<\t>Tail 10<\n>

Check the data

$ head -n 3 OpenBG500_train.tsv
ent_135492      rel_0352        ent_015651
ent_020765      rel_0448        ent_214183
ent_106905      rel_0418        ent_121073

Read the datasets

  1. Read the original data:
with open('OpenBG500_train.tsv', 'r') as fp:
    data = fp.readlines()
    train = [line.strip('\n').split('\t') for line in data]
    _ = [print(line) for line in train[:2]]
    # ['ent_135492', 'rel_0352', 'ent_015651']
    # ['ent_020765', 'rel_0448', 'ent_214183']
  1. Get the map of Entity(Relatioin)-Description: ent2text and rel2text:
with open('OpenBG500_entity2text.tsv', 'r') as fp:
    data = fp.readlines()
    lines = [line.strip('\n').split('\t') for line in data]
    _ = [print(line) for line in lines[:2]]
    # ['ent_101705', '短袖T恤']
    # ['ent_116070', '套装']

ent2text = {line[0]: line[1] for line in lines}

with open('OpenBG500_relation2text.tsv', 'r') as fp:
    data = fp.readlines()
    lines = [line.strip().split('\t') for line in data]
    _ = [print(line) for line in lines[:2]]
    # ['rel_0418', '细分市场']
    # ['rel_0290', '关联场景']

rel2text = {line[0]: line[1] for line in lines}
  1. Transfer the data to description:
train = [[ent2text[line[0]],rel2text[line[1]],ent2text[line[2]]] for line in train]
_ = [print(line) for line in train[:2]]
# ['苦荞茶', '外部材质', '苦荞麦']
# ['精品三姐妹硬糕', '口味', '原味硬糕850克【10包40块糕】']

Submit in Alibaba TIANCHI

OpenBG Benchmark:Large Scale Open Business Knowledge Graph Benchmark is a benchmark open for a long time. Welcome to submit your result of OpenBG500.

Baseline result

We do some baseline method on this dataset. TransE, DistMult and ComplEx result are based on OpenKE toolkit, KG-BERT and GenKGC results are based our code.

Method Hits@1 Hits@3 Hits@10
TransE 0.207 0.340 0.531
DistMult 0.049 0.088 0.216
ComplEx 0.053 0.120 0.266
KG-BERT 0.023 0.049 0.241
GenKGC 0.203 0.280 0.351

openbg500's People

Contributors

cheasim avatar timelordri avatar zxlzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

openbg500's Issues

复现指标很差

对TransE复现了一下 发现指标特别低 用的openke的库 请问您超参数是怎么设置的

建议更新一下READEME

从天池那边过来的,发现两边对数据集的格式描述不同,下载后发现这边的README里是错的。实际上是tsv格式。同时谷歌云盘似乎也失效了404

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.