ntmc-community / matchzoo Goto Github PK

View Code? Open in Web Editor NEW

3.8K 176.0 899.0 41.09 MB

Facilitating the design, comparison and sharing of deep text matching models.

License: Apache License 2.0

Python 55.42% Makefile 0.21% Jupyter Notebook 44.37%

text matching deep-learning text-matching neural-network natural-language-processing

matchzoo's Introduction

MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
MatchZoo 是一个通用的文本匹配工具包，它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。

🔥News: MatchZoo-py (PyTorch version of MatchZoo) is ready now.

The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification. With the unified data processing pipeline, simplified model configuration and automatic hyper-parameters tunning features equipped, MatchZoo is flexible and easy to use.

Tasks	Text 1	Text 2	Objective
Paraphrase Identification	string 1	string 2	classification
Textual Entailment	text	hypothesis	classification
Question Answer	question	answer	classification/ranking
Conversation	dialog	response	classification/ranking
Information Retrieval	query	document	ranking

Get Started in 60 Seconds

To train a Deep Semantic Structured Model, import matchzoo and prepare input data.

import matchzoo as mz

train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
valid_pack = mz.datasets.wiki_qa.load_data('dev', task='ranking')

Preprocess your input data in three lines of code, keep track parameters to be passed into the model.

preprocessor = mz.preprocessors.DSSMPreprocessor()
train_processed = preprocessor.fit_transform(train_pack)
valid_processed = preprocessor.transform(valid_pack)

Make use of MatchZoo customized loss functions and evaluation metrics:

ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
    mz.metrics.MeanAveragePrecision()
]

Initialize the model, fine-tune the hyper-parameters.

model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.params['task'] = ranking_task
model.guess_and_fill_missing_params()
model.build()
model.compile()

Generate pair-wise training data on-the-fly, evaluate model performance using customized callbacks on validation data.

train_generator = mz.PairDataGenerator(train_processed, num_dup=1, num_neg=4, batch_size=64, shuffle=True)
valid_x, valid_y = valid_processed.unpack()
evaluate = mz.callbacks.EvaluateAllMetrics(model, x=valid_x, y=valid_y, batch_size=len(valid_x))
history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)

References

Tutorials

English Documentation

中文文档

If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.

Install

MatchZoo is dependent on Keras and Tensorflow. Two ways to install MatchZoo:

Install MatchZoo from Pypi:

pip install matchzoo

Install MatchZoo from the Github source:

git clone https://github.com/NTMC-Community/MatchZoo.git
cd MatchZoo
python setup.py install

Models

DRMM: this model is an implementation of A Deep Relevance Matching Model for Ad-hoc Retrieval.
MatchPyramid: this model is an implementation of Text Matching as Image Recognition
ARC-I: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences
DSSM: this model is an implementation of Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
CDSSM: this model is an implementation of Learning Semantic Representations Using Convolutional Neural Networks for Web Search
ARC-II: this model is an implementation of Convolutional Neural Network Architectures for Matching Natural Language Sentences
MV-LSTM:this model is an implementation of A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations
aNMM: this model is an implementation of aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model
DUET: this model is an implementation of Learning to Match Using Local and Distributed Representations of Text for Web Search
K-NRM: this model is an implementation of End-to-End Neural Ad-hoc Ranking with Kernel Pooling
CONV-KNRM: this model is an implementation of Convolutional neural networks for soft-matching n-grams in ad-hoc search
models under development: Match-SRNN, DeepRank, BiMPM ....

Citation

If you use MatchZoo in your research, please use the following BibTex entry.

@inproceedings{Guo:2019:MLP:3331184.3331403,
 author = {Guo, Jiafeng and Fan, Yixing and Ji, Xiang and Cheng, Xueqi},
 title = {MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching},
 booktitle = {Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR'19},
 year = {2019},
 isbn = {978-1-4503-6172-9},
 location = {Paris, France},
 pages = {1297--1300},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3331184.3331403},
 doi = {10.1145/3331184.3331403},
 acmid = {3331403},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {matchzoo, neural network, text matching},
}

Development Team

Fan Yixing Core Dev ASST PROF, ICT	Wang Bo Core Dev M.S. TU Delft	Wang Zeyi Core Dev B.S. UC Davis	Pang Liang Core Dev ASST PROF, ICT	Yang Liu Core Dev PhD. UMASS
Wang Qinghua Documentation B.S. Shandong Univ.	Wang Zizhen Dev M.S. UCAS	Su Lixin Dev PhD. UCAS	Yang Zhou Dev M.S. CQUT	Tian Junfeng Dev M.S. ECNU

Contribution

Please make sure to read the Contributing Guide before creating a pull request. If you have a MatchZoo-related paper/project/compnent/tool, send a pull request to this awesome list!

Thank you to all the people who already contributed to MatchZoo!

Jianpeng Hou, Lijuan Chen, Yukun Zheng, Niuguo Cheng, Dai Zhuyun, Aneesh Joshi, Zeno Gantner, Kai Huang, stanpcf, ChangQF, Mike Kellogg

Project Organizers

Jiafeng Guo
- Institute of Computing Technology, Chinese Academy of Sciences
- Homepage
Yanyan Lan
- Institute of Computing Technology, Chinese Academy of Sciences
- Homepage
Xueqi Cheng
- Institute of Computing Technology, Chinese Academy of Sciences
- Homepage

License

Apache-2.0

matchzoo's People

Contributors

Stargazers

Watchers

Forkers

adolfoeliazat luochengleo yangliuy techstone githubclj zxsted hailiang-wang felipemoraes minghui chunlinx pchankh adedzy mdmustafizurrahman naplessss xinlwa dylan-fan songyf shannonyu houlinfang stevenlol liqiang-ict yuchengwang yannis-xie skytodmoon nicholaszj zhuding jsonbao yangyaoyunshu leezqcst byneilk smilesouth hdsmtiger ieee820 gokunwu neutony youngjt shenbeyond yu1ming bfsccb fpleihub boluoyu soros1223 zhoujialinmumu 2php frankatmech berli liuyajian awesome-archive jerriychen darcy0511 nkxujun hanjing0098 athenagoras stevenxue fangqingan edwardzh xupengcoding cclauss xyp079 ysnowy yanwii futurepw subvin sunqingquan xn0507 binbinbian allensmile yliuhb royshan ian-chao bygreencn zyyj007 lifa88 nh007cs huzuohuyou cuihengbin lihongjun hrlinlp ruzwdy dl-yc hydercps fulquan fjibj liberliushahe zhongxingpeng frankfqchen wenjunjiang frankblood frankchu0229 xiaogangli bhestir zhiyu-chen renlang97 changfengfeng kalengo fenghuangzhige bjfanchen colinsongf huyangenruc jianbotang

matchzoo's Issues

about the word embed

Is the word embedding in WiKiQA based on glove?
and how do you train the word embedding , or get the word embedding?
The embedding is also based on the dateset of WiKiQA？
Thank you

Visualization in MatchZoo?

how to achieve visualization in zoo ? For example, how to draw the loss curve

Use of python3

Good morning,
I would like to use this toolkit to run experiments and develop/edit deep models, my question is: can I run it with python 3 (with the other requirements) or I should use Python 2.7?
Thanks

How to achieve matchzoo distributed

Hi ,i want to achieve matchzoo distributed by tensorflow. Is this feasible?If feasible, can you say something about how to achieve it,thank you

K-NRM log1p a little different from original paper

Hi, I found the soft-TF in K-NRM is a little different from the original paper.

MatchZoo: log1p(x)
https://github.com/faneshion/MatchZoo/blob/c69cafa3d4f615f96f7d4ff058a537e1d101fa2a/matchzoo/models/knrm.py#L61

K-NRM: log(max(x, 1e-10)) * 0.01
https://github.com/AdeDZY/K-NRM/blob/fa5d60c38d894c3ef6cc7e580f60938773c3a8b3/knrm/model/model_knrm.py#L112

Is it a verified improvement?
cc: @AdeDZY

running "python setup.py install" failed

[ 8/11] Cythonizing /tmp/easy_install-gyR2M6/h5py-2.7.1/temp/easy_install-sqsKKi/Cython-0.27.3/Cython/Plex/Actions.py
[ 9/11] Cythonizing /tmp/easy_install-gyR2M6/h5py-2.7.1/temp/easy_install-sqsKKi/Cython-0.27.3/Cython/Plex/Scanners.py
[10/11] Cythonizing /tmp/easy_install-gyR2M6/h5py-2.7.1/temp/easy_install-sqsKKi/Cython-0.27.3/Cython/Runtime/refnanny.pyx
[11/11] Cythonizing /tmp/easy_install-gyR2M6/h5py-2.7.1/temp/easy_install-sqsKKi/Cython-0.27.3/Cython/Tempita/_tempita.py
warning: no files found matching '2to3-fixers.txt'
warning: no files found matching 'Doc/'
warning: no files found matching '.pyx' under directory 'Cython/Debugger/Tests'
warning: no files found matching '.pxd' under directory 'Cython/Debugger/Tests'
warning: no files found matching '.pxd' under directory 'Cython/Utility'
/tmp/easy_install-gyR2M6/h5py-2.7.1/temp/easy_install-sqsKKi/Cython-0.27.3/Cython/Plex/Scanners.c:19:20: fatal e
rror: Python.h: No such file or directory
#include "Python.h"
^
compilation terminated.
error: Setup script exited with error: command 'gcc' failed with exit status 1

[root@hadoop208 MatchZoo]# uname -a
Linux hadoop208 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@hadoop208 MatchZoo]# gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

[root@hadoop208 MatchZoo]# cat /proc/version
Linux version 3.10.0-514.16.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Apr 12 15:04:24 UTC 2017

why the vocab_size are fixed in the config files

an error for "steps_per_epoch = display_interval"

in main() function, there is a bug for display_interval parameter. It seems set wrongly in the main() funciton for train().
because in keras, steps_per_epoch doesn't mean it.

for i_e in range(num_iters):
for tag, generator in train_gen.items():
genfun = generator.get_batch_generator()
print('[%s]\t[Train:%s] ' % (time.strftime('%m-%d-%Y %H:%M:%S', time.localtime(time.time())), tag), end='')
history = model.fit_generator(
genfun,
steps_per_epoch = display_interval,
epochs = 1,
shuffle=False,
verbose = 0
) #callbacks=[eval_map])

GPU support

Hi there!

Any info on when (and if) we will have GPU support for the models?

Thanks!

wechat

Your Wechat MatchZoo Group is full, and i can't join you, could you please take me in. My wechat number is : hshrimp. Thank you.

I met errors when I run "python example/toy_example/test_preparation_for_classification.py"

[kkk@MatchZoo]$ python examples/toy_example/test_preparation_for_classify.py
Traceback (most recent call last):
File "examples/toy_example/test_preparation_for_classify.py", line 7, in
from preparation import *
ImportError: No module named preparation

Obviously, you said I should run "python examples/testpreparationforclassification.py". This is a mistake, right? I wonder why the usage is different from the descriptions in your readme...
And are you really sure that you can run the code on your server?

predict question of classify task

i run the classify task , there is a question that the accuracy of test in training is as high as 80%, but the accuracy of the test in predict is only 60%. How to deal with it?

thanks！

I cannot successfully install MatchZoo by running "python setup.py install"

It just went wrong when installing tensorflow and related packages, saying that "no local packages or working download links found for tensorflow"
Are you sure the command is bug-freed?

run MatchZoo/examples/wikiqa$ bash run_mvlstm.sh failed

mldl@mldlUB1604:/ub16_prj/MatchZoo/examples/wikiqa$ bash run_mvlstm.sh
Using TensorFlow backend.
2017-12-14 03:37:07.053967: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:37:07.053990: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:37:07.054015: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:37:07.054020: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:37:07.054025: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:37:07.142388: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-12-14 03:37:07.142703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 950M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.65GiB
2017-12-14 03:37:07.142718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-12-14 03:37:07.142723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-12-14 03:37:07.142734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)
{
"inputs": {
"test": {
"phase": "EVAL",
"input_type": "ListGenerator",
"relation_file": "./data/WikiQA/relation_test.txt",
"batch_list": 10
},
"predict": {
"phase": "PREDICT",
"input_type": "ListGenerator",
"relation_file": "./data/WikiQA/relation_test.txt",
"batch_list": 10
},
"train": {
"relation_file": "./data/WikiQA/relation_train.txt",
"input_type": "PairGenerator",
"batch_size": 100,
"batch_per_iter": 5,
"phase": "TRAIN",
"query_per_iter": 50,
"use_iter": false
},
"share": {
"vocab_size": 18670,
"use_dpool": false,
"embed_size": 50,
"target_mode": "ranking",
"text1_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"text2_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"embed_path": "./data/WikiQA/embed_glove_d50",
"text1_maxlen": 10,
"train_embed": false,
"text2_maxlen": 40
},
"valid": {
"phase": "EVAL",
"input_type": "ListGenerator",
"relation_file": "./data/WikiQA/relation_valid.txt",
"batch_list": 10
}
},
"global": {
"optimizer": "adadelta",
"num_iters": 400,
"save_weights_iters": 10,
"learning_rate": 0.0001,
"test_weights_iters": 400,
"weights_file": "examples/wikiqa/weights/mvlstm.wikiqa.weights",
"model_type": "PY",
"display_interval": 10
},
"outputs": {
"predict": {
"save_format": "TREC",
"save_path": "predict.test.wikiqa.txt"
}
},
"losses": [
{
"object_name": "rank_hinge_loss",
"object_params": {
"margin": 1.0
}
}
],
"metrics": [
"ndcg@3",
"ndcg@5",
"map"
],
"net_name": "MVLSTM",
"model": {
"model_py": "mvlstm.MVLSTM",
"setting": {
"dropout_rate": 0.5,
"hidden_size": 50,
"topk": 100
},
"model_path": "./matchzoo/models/"
}
}
[./data/WikiQA/embed_glove_d50]
Embedding size: 18677
Traceback (most recent call last):
File "matchzoo/main.py", line 328, in
main(sys.argv)
File "matchzoo/main.py", line 320, in main
train(config)
File "matchzoo/main.py", line 67, in train
share_input_conf['embed'] = convert_embed_2_numpy(embed_dict, embed = embed)
File "/home/mldl/ub16_prj/MatchZoo/matchzoo/utils/rank_io.py", line 93, in convert_embed_2_numpy
embed[k] = np.array(embed_dict[k])
IndexError: index 18670 is out of bounds for axis 0 with size 18670
mldl@mldlUB1604:/ub16_prj/MatchZoo/examples/wikiqa$

what should the resutls be like if I set the validation set and the prediction set the same?

A problem occured to me that when I set the training, validation and prediction dataset the same and run the pipline train and predict.

I used matchpyramid as the matching model, 0.0001 learning rate with 400 epochs for training. It displayed that during the last iterations, the accuracy on the eval set had reached 0.99+. However, when running prediction, loading the model of the last iteration, and predict on the training set, the accuracy was about 0.8.

Anyone knows what's wrong?

What's the logic of the implementation of rank hinge loss?

Given a positive-labeled sample s1 and negative-labeled sample s2, the neural model would output two prediction values, which is further considered in the rank hinge loss. I got a little confused about the logic of computing this loss in keras when diving into the code in the MatchZoo, how do you handle this problem? Thank you in advance.

DeepRank model

When does DeepRank model (proposed in cikm 17) could be released in MatchZoo ?

Missing example data

Hi, is it possible to provide the file of /data/example/ranking/word_triletter_map.txt?

ARC-II的实现问题

关于ARC-II的实现，有一个问题期待解决一下：

在实现ARC-II时，在做Flatten这步的时候，由于句子有padding, 最后会出现很多0项，之后再在后面加全连接层，会出现很多0。我当时实验下来效果不好，我想问一下你们是怎么解决的，在一开始padding的时候补的句子中的词吗？谢谢
https://github.com/faneshion/MatchZoo/blob/4431352bf62459f6b19c375690ff39173f20197f/matchzoo/models/arcii.py#L77-L84

models no init is no problem?

models no init is no problem? in the pycharm import matchzoo call No module named models,please tell me how to solve?

ndcg@及map精度问题

你好，请原谅用中文，有几个问题不是很理解请教一下：

github中各模型所得的ndcg及map的数据与论文(A Deep Relevance Matching Model for Ad-hoc Retrieval)中的数据相差甚远, 与这些提出这些模型的论文也同样如此。是因为不同数据导致的么，不清楚是怎么回事呢？
很多模型的实现并不是完全参照论文中的模型及超参数进行模型bulid的，这样用于论文数据的展示是否严谨？如ARCi模型卷积之后使用的是多层感知机，而代码中的arci模型直接接了一个softmax层。
classify的使用场景是什么呢？可否看成是ranking的一个特例。
感谢大师们的解答。

`weights` directory not found

I followed the README.md, cloned the repo and ran python matchzoo/main.py --phase train --model_file examples/toy_example/config/arci_ranking.config and this error is shown:

Traceback (most recent call last):
  File "matchzoo/main.py", line 328, in <module>
    main(sys.argv)
  File "matchzoo/main.py", line 320, in main
    train(config)
  File "matchzoo/main.py", line 178, in train
    model.save_weights(weights_file % (i_e+1))
  File "/home/zeyi/.virtualenvs/match-zoo/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2586, in save_weights
    f = h5py.File(filepath, 'w')
  File "/home/zeyi/.virtualenvs/match-zoo/local/lib/python2.7/site-packages/h5py/_hl/files.py", line 269, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/zeyi/.virtualenvs/match-zoo/local/lib/python2.7/site-packages/h5py/_hl/files.py", line 105, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 98, in h5py.h5f.create
IOError: Unable to create file (unable to open file: name = 'examples/toy_example/weights/arci_ranking.weights.10', errno = 2, error message = 'No such file or directory', flags = 13, o_fla
gs = 242)

I resolved this by mkdir examples/toy_example/weights. However, this issue needs to be addressed.

There are two possible solutions that comes to my mind:

make the directory, and touch a .keep file, which is the convention of keeping an empty directory in git
check os.path.exists(path) before saving weights to that directory, if no such path exists, os.mkdir(path)

Is it not right for "embed_path" in examples/wikiqa/config/drmm_wikiqa.config???

When I see the configuration file in examples/wikiqa/config/drmm_wikiqa.config and find this:

"embed_size": 300,
"embed_path": "./data/WikiQA/embed.idf",

this file "embed.idf" is generated from "cat word_stats.txt | cut -d ' ' -f 1,4 > embed.idf". And its content like this:
0 4.013749
1 4.035216
2 5.650094
3 8.964280

So is it a bug for configuration for "embed_path"?

Drmm_tks

Is it possible to provide further details on DRMM_TKS (I mean, paper).

model.fit_generator is too slow when dataset is large

//training is too slow when the dataset is large
genfun = generator.get_batch_generator()
history = model.fit_generator(
genfun,
steps_per_epoch = display_interval,
epochs = 1,
shuffle=False,
verbose = 1
)

Small bug: toy_example/weights/ folder not exists

When I tried to run the toy example, I got the error:
Unable to open file: name = 'examples/toy_example/weights/arci_ranking.weights.10', errno = 2, error message = 'no such file or directory'

The error was due to the weights folder has not been made. After manually adding the folder, the problem was solved.

安装不成功，报错

非常感谢开发组做出的努力，目前遇到2个问题
1.出了问题没法交流，能不能建个QQ群什么的，加微信加不上？
2.安装时报错：
Scanners.c
正在创建库 build\temp.win32-2.7\Release\users\sony\appdata\local\temp\easy_in
stall-ctqurr\h5py-2.7.1\temp\easy_install-mfrl5r\Cython-0.27.3\Cython\Plex\Scann
ers.lib 和对象 build\temp.win32-2.7\Release\users\sony\appdata\local\temp\easy_i
nstall-ctqurr\h5py-2.7.1\temp\easy_install-mfrl5r\Cython-0.27.3\Cython\Plex\Scan
ners.exp
LINK : fatal error LNK1104: 无法打开文件“build\temp.win32-2.7\Release\users\son
y\appdata\local\temp\easy_install-ctqurr\h5py-2.7.1\temp\easy_install-mfrl5r\Cyt
hon-0.27.3\Cython\Plex\Scanners.pyd.manifest”
error: Setup script exited with error: command 'F:\Program Files\Microsoft Vis
ual Studio 9.0\VC\BIN\link.exe' failed with exit status 1104

SparseFullyConnectedLayer Missing

Hi, I'm trying to set up DSSM, but in dssm.py SparseFullyConnectedLayer is used and the definition is missing. Can you upload the relevant files? Thanks!

Fix error in windows

Lib resource is a Unix Specific ,35.11. resource — Resource usage information

This will cause "No module named resouce" in windows paltform,
modify matchzoo\utils\utility.py to pass this error:

`
import os
import sys
import traceback
WIN=False
try:
import resource
except:
WIN=True

def show_layer_info(layer_name, layer_out):
print('[layer]: %s\t[shape]: %s \n%s' % (layer_name,str(layer_out.get_shape().as_list()), show_memory_use()))

def show_memory_use():
if WIN:
return ""
rusage_denom = 1024.
if sys.platform == 'darwin':
rusage_denom = rusage_denom * rusage_denom
ru = resource.getrusage(resource.RUSAGE_SELF)
total_memory = 1. * (ru.ru_maxrss + ru.ru_ixrss + ru.ru_idrss + ru.ru_isrss) / rusage_denom
strinfo = "\x1b[33m [Memory] Total Memory Use: %.4f MB \t Resident: %ld Shared: %ld UnshareData: %ld UnshareStack: %ld \x1b[0m" %
(total_memory, ru.ru_maxrss, ru.ru_ixrss, ru.ru_idrss, ru.ru_isrss)
return strinfo
`

有几个参数不太懂？能解答一下么？感谢！

relation_train.txt 中我有200万行的数据对。
在XXX.config中，有几个参数如下：
"num_iters": 22300,
"query_per_iter": 70,
"batch_per_iter": 5,
"batch_size": 100,

这几个分别是什么意思呢。不知道我理解是否对
batch_size是指一次batch跑100行数据
batch_per_iter是指一次跑5个batch
num_iters是指，一共跑22300次

那一共就会跑 5乘100乘22300 =11150000行数据对？

能否提供一个更换数据集的操作文档

能否提供一个用matchzoo模型更换数据集进行实验的操作文档，main.py看的有点晕

How to run the MatchZoo in windows

I try to run setup.py with python
but it display error
this is its error message

SystemExit: usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help

error: no commands supplied

C:\Users\MCLAB\AppData\Local\conda\conda\envs\Tensorflow\lib\site-packages\IPython\core\interactiveshell.py:2870: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

My python edition is 3.5.4
and I have already installed the tensorflow library

classification task error

I tried to run matchzoo to do classification task using command
sudo python main.py --phase train --model_file ./models/matchpyramid_classify.config
sudo python main.py --phase predict --model_file ./models/matchpyramid_classify.config
but i got an error :"
Traceback (most recent call last):
File "main.py", line 304, in
main(sys.argv)
File "main.py", line 298, in main
predict(config)
File "main.py", line 248, in predict
list_counts = input_data['list_counts']
KeyError: 'list_counts'
"
would please tell me why this happend?

dssm模型使用合页损失函数的意义？

训练toy_example的时候出错

训练examples的时候出现错误

执行命令：
python matchzoo/main.py --phase train --model_file examples/toy_example/config/arci_ranking.config

错误信息：
Traceback (most recent call last):
File "matchzoo/main.py", line 328, in
main(sys.argv)
File "matchzoo/main.py", line 320, in main
train(config)
File "matchzoo/main.py", line 178, in train
model.save_weights(weights_file % (i_e+1))
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 2586, in save_weights
f = h5py.File(filepath, 'w')
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/h5py/_hl/files.py", line 271, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/home/hadoop/anaconda2/lib/python2.7/site-packages/h5py/_hl/files.py", line 107, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
IOError: Unable to create file (Unable to open file: name = 'examples/toy_example/weights/arci_ranking.weights.10', errno = 2, error message = 'no such file or directory', flags = 13, o_flags = 242)

是这个文件的问题？
examples/toy_example/weights/arci_ranking.weights.10

run MatchZoo/examples/wikiqa$ bash run_dssm.sh failed

mldl@mldlUB1604:/ub16_prj/MatchZoo/examples/wikiqa$ bash run_dssm.sh
Using TensorFlow backend.
2017-12-14 03:34:23.080444: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:23.080467: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:23.080490: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:23.080496: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:23.080514: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:23.169856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-12-14 03:34:23.170205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 950M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.65GiB
2017-12-14 03:34:23.170236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-12-14 03:34:23.170242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-12-14 03:34:23.170271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)
{
"inputs": {
"test": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"predict": {
"phase": "PREDICT",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"train": {
"relation_file": "./data/WikiQA/relation_train.txt",
"input_type": "Triletter_PairGenerator",
"batch_size": 100,
"batch_per_iter": 5,
"dtype": "dssm",
"phase": "TRAIN",
"query_per_iter": 50,
"use_iter": false
},
"share": {
"vocab_size": 3314,
"embed_size": 1,
"target_mode": "ranking",
"text1_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"text2_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"word_triletter_map_file": "./data/WikiQA/word_triletter_map.txt"
},
"valid": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_valid.txt",
"dtype": "dssm"
}
},
"global": {
"optimizer": "adam",
"num_iters": 400,
"save_weights_iters": 10,
"learning_rate": 0.0001,
"test_weights_iters": 400,
"weights_file": "examples/wikiqa/weights/dssm.wikiqa.weights",
"model_type": "PY",
"display_interval": 10
},
"outputs": {
"predict": {
"save_format": "TREC",
"save_path": "predict.test.wikiqa.txt"
}
},
"losses": [
{
"object_name": "rank_hinge_loss",
"object_params": {
"margin": 1.0
}
}
],
"metrics": [
"ndcg@3",
"ndcg@5",
"map"
],
"net_name": "DSSM",
"model": {
"model_py": "dssm.DSSM",
"setting": {
"dropout_rate": 0.9,
"hidden_sizes": [
300
]
},
"model_path": "./matchzoo/models/"
}
}
[Embedding] Embedding Load Done.
[Input] Process Input Tags. [u'train'] in TRAIN, [u'test', u'valid'] in EVAL.
[./data/WikiQA/corpus_preprocessed.txt]
Data size: 24106
[Dataset] 1 Dataset Load Done.
{u'relation_file': u'./data/WikiQA/relation_train.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_PairGenerator', u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'batch_size': 100, u'batch_per_iter': 5, u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'TRAIN', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32), u'query_per_iter': 50, u'use_iter': False}
[./data/WikiQA/relation_train.txt]
Instance size: 20360
Pair Instance Count: 8995
[Triletter_PairGenerator] init done
{u'relation_file': u'./data/WikiQA/relation_test.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_test.txt]
Instance size: 2341
List Instance Count: 237
[Triletter_ListGenerator] init done
{u'relation_file': u'./data/WikiQA/relation_valid.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_valid.txt]
Instance size: 1126
List Instance Count: 122
[Triletter_ListGenerator] init done
[DSSM] init done
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 294.5273 MB Resident: 301596 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 294.5273 MB Resident: 301596 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 295.1914 MB Resident: 302276 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 295.1914 MB Resident: 302276 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Dot [shape]: [None, 1]
[Memory] Total Memory Use: 295.1914 MB Resident: 302276 Shared: 0 UnshareData: 0 UnshareStack: 0
[Model] Model Compile Done.
[12-14-2017 03:34:23] [Train:train] Traceback (most recent call last):
File "matchzoo/main.py", line 328, in
main(sys.argv)
File "matchzoo/main.py", line 320, in main
train(config)
File "matchzoo/main.py", line 151, in train
verbose = 0
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
TypeError: fit_generator() got an unexpected keyword argument 'shuffle'
Using TensorFlow backend.
2017-12-14 03:34:25.341013: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:25.341035: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:25.341060: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:25.341064: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:25.341069: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-12-14 03:34:25.406950: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-12-14 03:34:25.407200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 950M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.65GiB
2017-12-14 03:34:25.407216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-12-14 03:34:25.407220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-12-14 03:34:25.407230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)
{
"inputs": {
"test": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"predict": {
"phase": "PREDICT",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_test.txt",
"dtype": "dssm"
},
"train": {
"relation_file": "./data/WikiQA/relation_train.txt",
"input_type": "Triletter_PairGenerator",
"batch_size": 100,
"batch_per_iter": 5,
"dtype": "dssm",
"phase": "TRAIN",
"query_per_iter": 50,
"use_iter": false
},
"share": {
"vocab_size": 3314,
"embed_size": 1,
"target_mode": "ranking",
"text1_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"text2_corpus": "./data/WikiQA/corpus_preprocessed.txt",
"word_triletter_map_file": "./data/WikiQA/word_triletter_map.txt"
},
"valid": {
"phase": "EVAL",
"input_type": "Triletter_ListGenerator",
"batch_list": 10,
"relation_file": "./data/WikiQA/relation_valid.txt",
"dtype": "dssm"
}
},
"global": {
"optimizer": "adam",
"num_iters": 400,
"save_weights_iters": 10,
"learning_rate": 0.0001,
"test_weights_iters": 400,
"weights_file": "examples/wikiqa/weights/dssm.wikiqa.weights",
"model_type": "PY",
"display_interval": 10
},
"outputs": {
"predict": {
"save_format": "TREC",
"save_path": "predict.test.wikiqa.txt"
}
},
"losses": [
{
"object_name": "rank_hinge_loss",
"object_params": {
"margin": 1.0
}
}
],
"metrics": [
"ndcg@3",
"ndcg@5",
"map"
],
"net_name": "DSSM",
"model": {
"model_py": "dssm.DSSM",
"setting": {
"dropout_rate": 0.9,
"hidden_sizes": [
300
]
},
"model_path": "./matchzoo/models/"
}
}
[Embedding] Embedding Load Done.
[Input] Process Input Tags. [u'predict'] in PREDICT.
[./data/WikiQA/corpus_preprocessed.txt]
Data size: 24106
[Dataset] 1 Dataset Load Done.
{u'relation_file': u'./data/WikiQA/relation_test.txt', u'vocab_size': 3314, u'embed_size': 1, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'text2_corpus': u'./data/WikiQA/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/WikiQA/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'PREDICT', 'embed': array([[-0.18291523],
[-0.00574826],
[-0.13887608],
...,
[-0.17844775],
[-0.1465386 ],
[-0.13503003]], dtype=float32)}
[./data/WikiQA/relation_test.txt]
Instance size: 2341
List Instance Count: 237
[Triletter_ListGenerator] init done
[DSSM] init done
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 289.7930 MB Resident: 296748 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Input [shape]: [None, 3314]
[Memory] Total Memory Use: 289.7930 MB Resident: 296748 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 290.1719 MB Resident: 297136 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: MLP [shape]: [None, 300]
[Memory] Total Memory Use: 290.1719 MB Resident: 297136 Shared: 0 UnshareData: 0 UnshareStack: 0
[layer]: Dot [shape]: [None, 1]
[Memory] Total Memory Use: 290.4727 MB Resident: 297444 Shared: 0 UnshareData: 0 UnshareStack: 0
Traceback (most recent call last):
File "matchzoo/main.py", line 328, in
main(sys.argv)
File "matchzoo/main.py", line 322, in main
predict(config)
File "matchzoo/main.py", line 245, in predict
model.load_weights(weights_file)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2566, in load_weights
f = h5py.File(filepath, mode='r')
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 269, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 99, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to open file: name = 'examples/wikiqa/weights/dssm.wikiqa.weights.400', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
mldl@mldlUB1604:/ub16_prj/MatchZoo/examples/wikiqa$

ModuleNotFoundError : No module named 'jieba'

Hi, first, thank you for your dedication to this library.

I tried to run the script run_data.sh and I encountered the following error that wanted me to install 'jieba'. It seems like a Chinese segmentation library, but I don't have any plan to use it for Chinese texts. I wonder if I still have to install jieba, or is there any way to circumvent this issue. Thank you.

d3b122:WikiQA$ ./run_data.sh
--2018-01-10 16:17:59--  https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
Resolving download.microsoft.com... 23.35.220.223, 2600:140b:4:285::e59, 2600:140b:4:284::e59, ...
Connecting to download.microsoft.com|23.35.220.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7094233 (6.8M) [application/octet-stream]
Saving to: `WikiQACorpus.zip'

100%[=====================================================================================================================================>] 7,094,233   14.7M/s   in 0.5s    

2018-01-10 16:18:00 (14.7 MB/s) - `WikiQACorpus.zip' saved [7094233/7094233]

Archive:  WikiQACorpus.zip
   creating: WikiQACorpus/emnlp-table/
  inflating: WikiQACorpus/emnlp-table/WikiQA.CNN.dev.rank  
  inflating: WikiQACorpus/emnlp-table/WikiQA.CNN.test.rank  
  inflating: WikiQACorpus/emnlp-table/WikiQA.CNN-Cnt.dev.rank  
  inflating: WikiQACorpus/emnlp-table/WikiQA.CNN-Cnt.test.rank  
  inflating: WikiQACorpus/eval.py    
  inflating: WikiQACorpus/Guidelines_Phase1.pdf  
  inflating: WikiQACorpus/Guidelines_Phase2.pdf  
  inflating: WikiQACorpus/WikiQA.tsv  
  inflating: WikiQACorpus/WikiQA-dev.ref  
  inflating: WikiQACorpus/WikiQA-dev.tsv  
  inflating: WikiQACorpus/WikiQA-dev.txt  
  inflating: WikiQACorpus/WikiQA-dev-filtered.ref  
  inflating: WikiQACorpus/WikiQASent.pos.ans.tsv  
  inflating: WikiQACorpus/WikiQA-test.ref  
  inflating: WikiQACorpus/WikiQA-test.tsv  
  inflating: WikiQACorpus/WikiQA-test.txt  
  inflating: WikiQACorpus/WikiQA-test-filtered.ref  
  inflating: WikiQACorpus/WikiQA-train.ref  
  inflating: WikiQACorpus/WikiQA-train.tsv  
  inflating: WikiQACorpus/WikiQA-train.txt  
  inflating: WikiQACorpus/LICENSE.pdf  
  inflating: WikiQACorpus/README.txt  
--2018-01-10 16:18:00--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2018-01-10 16:18:00--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: `glove.840B.300d.zip'

100%[===================================================================================================================================>] 2,176,768,927  442K/s   in 80m 59s 

2018-01-10 17:39:00 (438 KB/s) - `glove.840B.300d.zip' saved [2176768927/2176768927]

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     
--2018-01-10 17:40:09--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2018-01-10 17:40:09--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: `glove.6B.zip'

100%[=====================================================================================================================================>] 862,182,613  389K/s   in 37m 25s 

2018-01-10 18:17:35 (375 KB/s) - `glove.6B.zip' saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
Traceback (most recent call last):
  File "prepare_mz_data.py", line 10, in <module>
    from preparation import Preparation
  File "../../matchzoo/inputs/preparation.py", line 11, in <module>
    import preprocess
  File "../../matchzoo/inputs/preprocess.py", line 4, in <module>
    import jieba
ModuleNotFoundError: No module named 'jieba'
load word dict ...
Traceback (most recent call last):
  File "gen_w2v.py", line 126, in <module>
    word_dict = load_word_dict(word_dict_file)
  File "gen_w2v.py", line 107, in load_word_dict
    for line in tqdm(io.open(word_map_file, encoding='utf8')):
FileNotFoundError: [Errno 2] No such file or directory: 'word_dict.txt'
Traceback (most recent call last):
  File "norm_embed.py", line 14, in <module>
    with codecs.open(infile, 'r', encoding='utf8') as f:
  File "/home1/irteam/anaconda3/lib/python3.6/codecs.py", line 895, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'embed_glove_d300'
load word dict ...
Traceback (most recent call last):
  File "gen_w2v.py", line 126, in <module>
    word_dict = load_word_dict(word_dict_file)
  File "gen_w2v.py", line 107, in load_word_dict
    for line in tqdm(io.open(word_map_file, encoding='utf8')):
FileNotFoundError: [Errno 2] No such file or directory: 'word_dict.txt'
Traceback (most recent call last):
  File "norm_embed.py", line 14, in <module>
    with codecs.open(infile, 'r', encoding='utf8') as f:
  File "/home1/irteam/anaconda3/lib/python3.6/codecs.py", line 895, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: 'embed_glove_d50'
cat: word_stats.txt: No such file or directory
Traceback (most recent call last):
  File "gen_hist4drmm.py", line 8, in <module>
    from preprocess import cal_hist
  File "../../matchzoo/inputs/preprocess.py", line 4, in <module>
    import jieba
ModuleNotFoundError: No module named 'jieba'
Traceback (most recent call last):
  File "gen_binsum4anmm.py", line 12, in <module>
    from preprocess import cal_binsum
  File "../../matchzoo/inputs/preprocess.py", line 4, in <module>
    import jieba
ModuleNotFoundError: No module named 'jieba'
Done ...

If I do not want to use pre-trained word embedding , how to do it

failure of "python setup.py install" ???

[root@training2 MatchZoo]# rm /usr/lib/python2.7/site-packages/MatchZoo-0.2.0-py2.7.egg
rm: remove regular file ‘/usr/lib/python2.7/site-packages/MatchZoo-0.2.0-py2.7.egg’? y
[root@training2 MatchZoo]#
[root@training2 MatchZoo]#
[root@training2 MatchZoo]# python setup.py install
running install
running bdist_egg
running egg_info
writing requirements to MatchZoo.egg-info/requires.txt
writing MatchZoo.egg-info/PKG-INFO
writing top-level names to MatchZoo.egg-info/top_level.txt
writing dependency_links to MatchZoo.egg-info/dependency_links.txt
reading manifest file 'MatchZoo.egg-info/SOURCES.txt'
writing manifest file 'MatchZoo.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/matchzoo
creating build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/init.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/list_generator.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/pair_generator.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/point_generator.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/preparation.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
copying build/lib/matchzoo/inputs/preprocess.py -> build/bdist.linux-x86_64/egg/matchzoo/inputs
creating build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/DynamicMaxPooling.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/Match.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/MatchTensor.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/NonMasking.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/SparseFullyConnectedLayer.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
copying build/lib/matchzoo/layers/init.py -> build/bdist.linux-x86_64/egg/matchzoo/layers
creating build/bdist.linux-x86_64/egg/matchzoo/losses
copying build/lib/matchzoo/losses/init.py -> build/bdist.linux-x86_64/egg/matchzoo/losses
copying build/lib/matchzoo/losses/rank_losses.py -> build/bdist.linux-x86_64/egg/matchzoo/losses
creating build/bdist.linux-x86_64/egg/matchzoo/metrics
copying build/lib/matchzoo/metrics/init.py -> build/bdist.linux-x86_64/egg/matchzoo/metrics
copying build/lib/matchzoo/metrics/evaluations.py -> build/bdist.linux-x86_64/egg/matchzoo/metrics
copying build/lib/matchzoo/metrics/rank_evaluations.py -> build/bdist.linux-x86_64/egg/matchzoo/metrics
creating build/bdist.linux-x86_64/egg/matchzoo/utils
copying build/lib/matchzoo/utils/init.py -> build/bdist.linux-x86_64/egg/matchzoo/utils
copying build/lib/matchzoo/utils/rank_io.py -> build/bdist.linux-x86_64/egg/matchzoo/utils
copying build/lib/matchzoo/utils/roc_auc.py -> build/bdist.linux-x86_64/egg/matchzoo/utils
copying build/lib/matchzoo/utils/utility.py -> build/bdist.linux-x86_64/egg/matchzoo/utils
copying build/lib/matchzoo/init.py -> build/bdist.linux-x86_64/egg/matchzoo
copying build/lib/matchzoo/main.py -> build/bdist.linux-x86_64/egg/matchzoo
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/list_generator.py to list_generator.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/pair_generator.py to pair_generator.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/point_generator.py to point_generator.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/preparation.py to preparation.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/inputs/preprocess.py to preprocess.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/DynamicMaxPooling.py to DynamicMaxPooling.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/Match.py to Match.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/MatchTensor.py to MatchTensor.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/NonMasking.py to NonMasking.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/SparseFullyConnectedLayer.py to SparseFullyConnectedLayer.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/layers/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/losses/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/losses/rank_losses.py to rank_losses.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/metrics/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/metrics/evaluations.py to evaluations.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/metrics/rank_evaluations.py to rank_evaluations.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/utils/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/utils/rank_io.py to rank_io.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/utils/roc_auc.py to roc_auc.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/utils/utility.py to utility.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/matchzoo/main.py to main.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying MatchZoo.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MatchZoo.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MatchZoo.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MatchZoo.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MatchZoo.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/MatchZoo-0.2.0-py2.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing MatchZoo-0.2.0-py2.7.egg
Copying MatchZoo-0.2.0-py2.7.egg to /usr/lib/python2.7/site-packages
Adding MatchZoo 0.2.0 to easy-install.pth file

Installed /usr/lib/python2.7/site-packages/MatchZoo-0.2.0-py2.7.egg
Traceback (most recent call last):
File "setup.py", line 38, in
'tqdm >= 4.19.4'
File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/site-packages/setuptools/command/install.py", line 73, in run
self.do_egg_install()
File "/usr/lib/python2.7/site-packages/setuptools/command/install.py", line 101, in do_egg_install
cmd.run()
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 380, in run
self.easy_install(spec, not self.no_deps)
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 604, in easy_install
return self.install_item(None, spec, tmpdir, deps, True)
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 655, in install_item
self.process_distribution(spec, dist, deps)
File "/usr/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 701, in process_distribution
distreq.project_name, distreq.specs, requirement.extras
TypeError: init() takes exactly 2 arguments (4 given)

Could not find suitable distribution for Requirement.parse('tensorflow>=1.1.0')

When I trying to run "python setup.py install", it gives me this error:
Could not find suitable distribution for Requirement.parse('tensorflow>=1.1.0')
Is there anyone can help on this?

你好，请问这个能做中文的QA吗，谢谢了？

Missing data

Hi,

I have tried to setup the project as described however the data is missing.
Looking at the models config file I could state that are referencing unavailable folders. If you try to setup the project in a vanilla enviroment will find something like this.

IOError: [Errno 2] No such file or directory: u'../data/mq2007/embed.idf'

ImportError: No module named 'resource'

When i use "python matchzoo/main.py --phase train --model_file examples/toy_example/config/arci_ranking.config" to test, however，The following problems arise ：

D:\tool\MatchZoo-master\matchzoo>python main.py --phase train --model_file ./moels/arci_ranking.config
Using TensorFlow backend.
Traceback (most recent call last):
File "main.py", line 20, in
File "D:\tool\MatchZoo-master\matchzoo\utils_init_.py", line 9, in <module from .utility import import_class
File "D:\tool\MatchZoo-master\matchzoo\utils\utility.py", line 5, in
import resource
ImportError: No module named 'resource'

what can i do?

TREC file score

The TREC file generated by using https://github.com/faneshion/MatchZoo/blob/master/examples/run_dssm_ranking.sh
seems have all the same scores of "1.000000". Could you plz help to double check?

DeepRank model

Do you plan to release code for your DeepRank model (CIKM'17) as part of MatchZoo?

Segmentation fault running DSSM on another dataset

python matchzoo/main.py --phase train --model_file examples/config/dssm_ranking.config 
Using TensorFlow backend.
2018-01-08 11:47:26.702599: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX
{
  "inputs": {
    "test": {
      "phase": "EVAL", 
      "input_type": "Triletter_ListGenerator", 
      "batch_list": 10, 
      "relation_file": "./data/relation_test.txt", 
      "dtype": "dssm"
    }, 
    "predict": {
      "phase": "PREDICT", 
      "input_type": "Triletter_ListGenerator", 
      "batch_list": 10, 
      "relation_file": "./data/relation_test.txt", 
      "dtype": "dssm"
    }, 
    "train": {
      "relation_file": "./data/relation_train.txt", 
      "input_type": "Triletter_PairGenerator", 
      "batch_size": 100, 
      "batch_per_iter": 5, 
      "dtype": "dssm", 
      "phase": "TRAIN", 
      "query_per_iter": 3, 
      "use_iter": true
    }, 
    "share": {
      "vocab_size": 3484, 
      "embed_size": 10, 
      "target_mode": "ranking", 
      "text1_corpus": "./data/corpus_preprocessed.txt", 
      "text2_corpus": "./data/corpus_preprocessed.txt", 
      "word_triletter_map_file": "./data/word_triletter_map.txt"
    }, 
    "valid": {
      "phase": "EVAL", 
      "input_type": "Triletter_ListGenerator", 
      "batch_list": 10, 
      "relation_file": "./data/relation_valid.txt", 
      "dtype": "dssm"
    }
  }, 
  "global": {
    "optimizer": "adam", 
    "num_iters": 10, 
    "save_weights_iters": 10, 
    "learning_rate": 0.0001, 
    "test_weights_iters": 10, 
    "weights_file": "examples/weights/dssm_ranking.weights", 
    "model_type": "PY", 
    "display_interval": 10
  }, 
  "outputs": {
    "predict": {
      "save_format": "TREC", 
      "save_path": "predict.test.dssm_ranking.txt"
    }
  }, 
  "losses": [
    {
      "object_name": "rank_hinge_loss", 
      "object_params": {
        "margin": 1.0
      }
    }
  ], 
  "metrics": [
    "ndcg@3", 
    "ndcg@5", 
    "map"
  ], 
  "net_name": "dssm", 
  "model": {
    "model_py": "dssm.DSSM", 
    "setting": {
      "dropout_rate": 0.5, 
      "hidden_sizes": [
        100, 
        30
      ]
    }, 
    "model_path": "matchzoo/models/"
  }
}
[Embedding] Embedding Load Done.
[Input] Process Input Tags. [u'train'] in TRAIN, [u'test', u'valid'] in EVAL.
[./data/corpus_preprocessed.txt]
        Data size: 71849
[Dataset] 1 Dataset Load Done.
{u'relation_file': u'./data/relation_train.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_PairGenerator', u'text1_corpus': u'./data/corpus_preprocessed.txt', u'batch_size': 100, u'batch_per_iter': 5, u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'TRAIN', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
         0.00907838,  0.13784599],
       [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
         0.03414046, -0.14042418],
       [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
        -0.09182639, -0.0633784 ],
       ..., 
       [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
        -0.07183626,  0.06968395],
       [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
         0.03873055, -0.14228183],
       [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
         0.04853643,  0.09921838]], dtype=float32), u'query_per_iter': 3, u'use_iter': True}
[./data/relation_train.txt]
        Instance size: 32953
[Triletter_PairGenerator] init done
{u'relation_file': u'./data/relation_test.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/corpus_preprocessed.txt', u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
         0.00907838,  0.13784599],
       [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
         0.03414046, -0.14042418],
       [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
        -0.09182639, -0.0633784 ],
       ..., 
       [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
        -0.07183626,  0.06968395],
       [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
         0.03873055, -0.14228183],
       [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
         0.04853643,  0.09921838]], dtype=float32)}
[./data/relation_test.txt]
        Instance size: 25535
List Instance Count: 1445
[Triletter_ListGenerator] init done
{u'relation_file': u'./data/relation_valid.txt', u'vocab_size': 3484, u'embed_size': 10, u'target_mode': u'ranking', u'input_type': u'Triletter_ListGenerator', u'batch_list': 10, u'text1_corpus': u'./data/corpus_preprocessed.txt', u'text2_corpus': u'./data/corpus_preprocessed.txt', u'word_triletter_map_file': u'./data/word_triletter_map.txt', u'dtype': u'dssm', u'phase': u'EVAL', 'embed': array([[-0.18291523, -0.00574826, -0.13887608, ..., -0.13666791,
         0.00907838,  0.13784599],
       [ 0.03368587,  0.13503729,  0.00107509, ...,  0.18584302,
         0.03414046, -0.14042418],
       [ 0.03610065,  0.19066425,  0.11800677, ...,  0.14983599,
        -0.09182639, -0.0633784 ],
       ..., 
       [ 0.1179866 , -0.19746014,  0.08622313, ..., -0.02868197,
        -0.07183626,  0.06968395],
       [-0.02044802,  0.17994043, -0.0810562 , ...,  0.03050527,
         0.03873055, -0.14228183],
       [ 0.04971068,  0.16548306,  0.08958763, ...,  0.0537957 ,
         0.04853643,  0.09921838]], dtype=float32)}
[./data/relation_valid.txt]
        Instance size: 24919
List Instance Count: 1443
[Triletter_ListGenerator] init done
[DSSM] init done
[layer]: Input  [shape]: [None, 3484] 
 [Memory] Total Memory Use: 249.0977 MB          Resident: 261197824 Shared: 0 UnshareData: 0 UnshareStack: 0 
[layer]: Input  [shape]: [None, 3484] 
 [Memory] Total Memory Use: 249.1133 MB          Resident: 261214208 Shared: 0 UnshareData: 0 UnshareStack: 0 
[layer]: MLP    [shape]: [None, 30] 
 [Memory] Total Memory Use: 250.2773 MB          Resident: 262434816 Shared: 0 UnshareData: 0 UnshareStack: 0 
[layer]: MLP    [shape]: [None, 30] 
 [Memory] Total Memory Use: 250.5195 MB          Resident: 262688768 Shared: 0 UnshareData: 0 UnshareStack: 0 
[layer]: Dot    [shape]: [None, 1] 
 [Memory] Total Memory Use: 250.6992 MB          Resident: 262877184 Shared: 0 UnshareData: 0 UnshareStack: 0 
[Model] Model Compile Done.
Segmentation fault: 11

人已经满了

二维码人数已经超过100 ，请拉一下 w16402151618

How to set the batch size for prediction?

Hi all, I think it is possible to set the training batch size as 100 and predicting size as 10, right?
So I tried different sizes of predicting batch sizes, from 1, 10, to 100, after predicting, there are different results:
It is for binary classification using match_pyramid and predict totally 42,155 samples.
size = 1 numpy.core._internal.AxisError: axis 1 is out of bounds for array of dimension 1
size=10 predict and output predicting results for 42,142 samples
size=50 predict and output predicting results for 42,142 samples
size=100 predict and output predicting results for 42,092 samples
Anyone knows what was wrong?

How to import my own text data, as well as through and matchzoo generate the specified data text file

I have run some toy_example in matchzoo ,but i want to train model and predict with my own text data, I need your help please .

working examples?

Do you have any working examples that can help me know more about it ?