polyai-ldn / conversational-datasets Goto Github PK

View Code? Open in Web Editor NEW

1.2K 74.0 163.0 182 KB

Large datasets for conversational AI

License: Apache License 2.0

Python 83.87% Shell 0.85% Jupyter Notebook 15.27%

conversational-ai datasets machine-learning

conversational-datasets's People

Contributors

Stargazers

Watchers

Forkers

jaynoel vn09 prasanth-balaraman1993 huyanluanyu1949 silviogn shankar0206 autocompare barseghyanartur johndpope bigheiniu pawel-polyai merajat sanmusunrise letsdodatascience shubhampachori12110095 minwoo floatsdsds jlsutherland chenyangh landeraxe tspannhw jamiemoon ashutoshsingh0223 entn-at zergey hhy5277 tesschin kduk tm-data-ict-solutions nunofernandes-plight jb33k fancyerii ssalvatierra777 yetianlinguistics lotapp happyyolanda javiersastre yanyiting npow devileee rezabakhtiari yaduvanshiankitofficial katherinelyx olibart xuhuiren theo-m strategicallynicole gsharma171 hzitoun jithinraj statdataanalyzer mahdimor waunbroderick zhaoxlpku arkothiwala roholazandie jcarlosneto chq fpli-mbr thelostpeace seahrh kentonl tomgun132 auscenery sid3345 strategist922 embeddedsamurai misterciput lbda1 kaggledevs student-welfare-portal codeaudit sabitaacharya cxz xhc19930714 mayatarno luyulalala hitman56 torshie nadavo arita37 sandy1811 reetika27 mkhoin gsgoncalves jimbag007 fariasfc nagoudi e0397123 tshahpuri yuimo shanedroogan boeing787 lei-li eddgachi biddwan09 writecorrectenglish mrc03 nomanabdullah lukemshannonhill

conversational-datasets's Issues

Would you release the data on google drive?

Indeed, google cloud is not provided to developers in China so that I can't run the script with defalut storage space.....

Would you release other solution for download the data?

Thank you very much!

GCP Authentication Failure

Hi, I was just wondering what the fix is for this issue. For the reddit dataset, I have followed all the steps up to before executing:
python tools/tfrutil.py pp ${DATADIR?}/train-00999-of-01000.tfrecords

But when I do, I get this error:

2019-05-24 10:20:36.304120: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.753181 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.063359: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.220163 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.288225: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.174629 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.466637: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.375593 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.847847: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.587914 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:38.440436: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.06559 seconds (attempt 6 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:39.512649: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.777596 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:40.294343: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.71192 seconds (attempt 8 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:42.010957: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.02302 seconds (attempt 9 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:43.041673: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.138 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:44.186215: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
File "tools/tfrutil.py", line 118, in
_cli()
File "/Library/Python/2.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/Library/Python/2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Library/Python/2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Python/2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Python/2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "tools/tfrutil.py", line 46, in _pretty_print
for i, record in enumerate(tf.python_io.tf_record_iterator(path)):
File "/Library/Python/2.7/site-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
reader.GetNext()
File "/Library/Python/2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords.",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords."
}
}
'
when reading metadata of gs://reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords

I suppose this is due to it not being able to access my credentials, so I followed the instructions here:

https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances

and downloaded a <project>-<code>.json file with
{
"type": "service_account",
"project_id": "xxxx",
"private_key_id": "xxxxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY-----\n
xxxxxxx
\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "xxxxxxx",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "xxxxxxxxxxxxxx",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/xxxxxxx"
}

The error still persists. I would really appreciate any advice.

A tiny error in documentation

Thanks for sharing this code, and I am glad I have unused google cloud credits :)
Just found one tiny error while following the instructions of creating opensubtitles dataset

--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines-* \

====>

--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-* \

Missing required option: region – in Google Cloud

Hello!

I get the following error when trying to execute the create_data.py script in the Google Cloud Shell:

Traceback (most recent call last):
  File "reddit/create_data.py", line 347, in <module>
    run()
  File "reddit/create_data.py", line 285, in run
    p = beam.Pipeline(options=pipeline_options)
  File "/home/user/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 203, in __init__
    'Pipeline has validations errors: \n' + '\n'.join(errors))
ValueError: Pipeline has validations errors:
Missing required option: region.

I'm using the latest version of apache-beam, 2.23.0.

Any ways to process dataset on local machines?

This repo processes datasets using GCP services. May I know if there is any tutorial to use your scripts to process raw data on local machines?

Thanks,
Peixiang

apache-beam==2.5.0 requirements error

Hey, I am trying to install the requirements for this codebase via:

pip install -r requirements.txt

but am getting this error:

ERROR: No matching distribution found for apache-beam==2.5.0 (from -r conversational-datasets/requirements.txt (line 2))

I tried to fix this by installing apache-beam==2.5.0 via pip, but pip complains that it cannot find that distribution. After investigation, it looks like the latest version on pypi is 2.2.0:

https://pypi.org/project/apache-beam/

Any suggestions on how to proceed? Thank you!

Is it possible to get access to the raw data from other storage/computational platform and read/process data there.

Hi Matthew, a quick question. I'm currently using the free trail of Google Cloud which does not allow me to change the quota settings. So far I can only use 7 workers and the data processing is very slow. Is it possible for me to get access to the raw data from other storage/computational platform or maybe download them to my local CPU cluster to do the data processing?

'Comment' is not defined

NameError: name 'Comment' is not defined [while running 'Normalise comments-ptransform-58']

how to run ?

i don't want to run this code on google cloud i just want it till ("Extract the data and split it into shards")
but i don't know how to do it can someone explain me how to run this commands

this one

PROJECT="your-google-cloud-project"

DATADIR="gs://${BUCKET?}/opensubtitles/$(date +"%Y%m%d")"

python opensubtitles/create_data.py
--output_dir ${DATADIR?}
--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-*
--runner DataflowRunner
--temp_location ${DATADIR?}/temp
--staging_location ${DATADIR?}/staging
--project ${PROJECT?}
--dataset_format TF

Response selection at test time

What if I want to test the model on a single example (context)? what will be the candidate responses to calculate relevance scores with the context? Are there any techniques to retrieve a number of response candidates from the large corpus? It would be helpful if you share some details regarding this. Thanks

The app is blocked

This app is blocked
This app tried to access sensitive info in your Google Account. To keep your account safe, Google blocked this access.

AttributeError: 'NoneType' object has no attribute 'Client'

INFO:apache_beam.runners.dataflow.dataflow_runner:Could not estimate size of source <apache_beam.io.gcp.bigquery._CustomBigQuerySource object at 0x7faa91a5c0d0> due to an exception: Traceback (most recent call last):
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1186, in run_Read
transform.source.estimate_size())
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery.py", line 764, in estimate_size
bq = bigquery_tools.BigQueryWrapper()
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 336, in init
self.gcp_bq_client = client or gcp_bigquery.Client(
AttributeError: 'NoneType' object has no attribute 'Client'

Access not available: "http://models.poly-ai.com/convert/v1/model.tar.gz"

It used to work but not available now.

Removing samples containing profanity

Do you think it makes sense to remove samples containing profanity?

Large datasets

"No module named module_wrapper"

Hi, I got the same error and solve it by updating the httplib2 to the latest version. Regarding the requirements, I also updated tensorflow==1.15.0 since version 1.14.0 gives me the following error: "No module named deprecation_wrapper".

Hmmm, did you change anything from the requirements.txt file other than update httplib2 to newest and updating tensorflow to 1.15.0? I did both of those things but now am getting a "No module named module_wrapper" error :(

Originally posted by @amorisot in #70 (comment)

Chinese Data

I get this error

TypeError: Unicode-objects must be encoded before hashing [while running 'split train and test/ParDo(_TrainTestSplitFn)']

Local Download

I want to download the Reddit dataset on my local machine in JSON format. How can I do it?

How to submit RTBF

I (as well as numerous other users) would prefer our usernames and messages are not included in large datasets. Where can we submit a Request To Be Forgotten?

Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

How to solve this problem, as shown in the figure

Support for python3

Please include support for python3, thanks :)

Why not include posts in the Reddit dataset?

As the title says, using only comments leads to some information loss because the post is not used.

Any reasons?

No module named deprecation_wrapper

Hi Matthew, I met this problem when I was running:

python reddit/create_data.py --output_dir ${DATADIR?} --reddit_table ${PROJECT?}:${DATASET?}.${TABLE?} --runner DataflowRunner --temp_location ${DATADIR?}/temp --staging_location ${DATADIR?}/staging --project ${PROJECT?} --dataset_format JSON --noauth_local_webserver

I1026 01:39:37.442483 139859749066496 dataflow_runner.py:177] 2020-10-26T01:39:37.193Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 766, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 482, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 254, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
import(module)
ImportError: No module named deprecation_wrapper

Bad TF version in requirements file

On using pip install apache-beam==2.5.0 I'm getting:

ERROR: tensorflow 1.12.0 has requirement tensorboard<1.13.0,>=1.12.0, but you'll have tensorboard 1.13.1 which is incompatible.

I would assume the error comes from

conversational-datasets/requirements.txt

Line 73 in 6282739

tensorflow==1.12.0

Will push a PR if we can confirm the issue.

Releasing Bert weights trained on reddit

Was wondering if it would be possible to release weights of the Bert models trained on reddit?

Line 114

NameError: name 'Comment' is not defined

Using pytorch

Was wondering if you any ideas on parsing these tf.Record files in pytorch?

Get more workers with Google Cloud's free trial

In case this helps anyone: my dataflow job on Google Cloud was restricted to 1 worker, so I added the following argument when launching create_data.py:
--num_workers 8
Now the job was restricted to 4 workers instead due to the quota on external IP addresses. So I turned them off with:
--no_use_public_ips
and enabling Private Google Access for the used region's subnet (us-central1 for me) in the VPC networks section.

AmazonQA Data Size

Hi,

I have downloaded the Amazon data (38 files) and ran the create_data.py by

python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON

It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.

What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.

P.S. I saw some filtering functions have been done in the create_data.py file.

Below are some statistics of the conversational dataset:

Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007

Thank you in advance for your kind reply.

Reddit dataset seems to not be showing the full comment

I had finished downloading the reddit data and I can view them using the tfrutil.py module provided pretty nicely. The problem I'm encountering is that a lot of the comments seem to be cut off abruptly. For example:

Example 3

[Context]:
Quite a bit according to reddits own gold statistics.
[Response]:
I fucking clicked that without even checking the URL first. Goddamn it. You win the day fine sir.

Extra Contexts:
[context/0]:
I would like to know how much was spent on Reddit gold for people posting a Rick Roll. I bet it's way more than $12 (though to

Example 6

[Context]:
Rather you're rewarded for planning ahead because you don't have a crutch "save me" button.
[Response]:
Not like DBM will tell you when things are coming..

Extra Contexts:
[context/7]:
Warriors on suicide watch \n \n (I don't mind the gcd change on hunters actually. And the two ranged specs seem decent. Survival
[context/6]:
The Disengage on GCD is worse than any of the warriors spells.\n \n People made up stupid builds to show how you could go at
[context/5]:
I dunno, I got used to the disengage change pretty quick. It was annoying at first, mostly because you feel forced to stop
[context/4]:
Defensive abilities should be reactive and putting them locked because you're doing something else hurts the gameplay.\n \n
[context/3]:
To be fair Disengage isn't a purely defensive spell. I use disengage offensively all the time as a survival hunter, even as MM
[context/2]:
But you also use it to dodge problematic stuff while you're DPSing.
[context/1]:
Which is why it has reduced gcd
[context/0]:
But you're still penalized for playing to the best of your class.

As you can see, the comments seem to just be cut off like "(though to" and ". Survival", etc. As I'm using this data for what's intended for -- conversations, this could be pretty detrimental to the performance of the model since the continuation doesn't make sense. Any thoughts on this?

Thanks

edit: These examples are pulled from test-00001-of-00100.tfrecords if it helps. I don't know if your train-test splitting is random for everyone.

polyai-ldn / conversational-datasets Goto Github PK

conversational-datasets's People

Contributors

Stargazers

Watchers

Forkers

conversational-datasets's Issues

Example 3

Example 6

Recommend Projects

Recommend Topics

Recommend Org