polyai-ldn / conversational-datasets Goto Github PK
View Code? Open in Web Editor NEWLarge datasets for conversational AI
License: Apache License 2.0
Large datasets for conversational AI
License: Apache License 2.0
Indeed, google cloud is not provided to developers in China so that I can't run the script with defalut storage space.....
Would you release other solution for download the data?
Thank you very much!
Hi, I was just wondering what the fix is for this issue. For the reddit dataset, I have followed all the steps up to before executing:
python tools/tfrutil.py pp ${DATADIR?}/train-00999-of-01000.tfrecords
But when I do, I get this error:
2019-05-24 10:20:36.304120: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.753181 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.063359: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.220163 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.288225: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.174629 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.466637: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.375593 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.847847: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.587914 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:38.440436: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.06559 seconds (attempt 6 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:39.512649: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.777596 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:40.294343: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.71192 seconds (attempt 8 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:42.010957: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.02302 seconds (attempt 9 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:43.041673: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.138 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:44.186215: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
File "tools/tfrutil.py", line 118, in
_cli()
File "/Library/Python/2.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/Library/Python/2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Library/Python/2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Python/2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Python/2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "tools/tfrutil.py", line 46, in _pretty_print
for i, record in enumerate(tf.python_io.tf_record_iterator(path)):
File "/Library/Python/2.7/site-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
reader.GetNext()
File "/Library/Python/2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords.",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords."
}
}
'
when reading metadata of gs://reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords
I suppose this is due to it not being able to access my credentials, so I followed the instructions here:
https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances
and downloaded a <project>-<code>.json
file with
{
"type": "service_account",
"project_id": "xxxx",
"private_key_id": "xxxxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY-----\n
xxxxxxx
\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "xxxxxxx",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "xxxxxxxxxxxxxx",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/xxxxxxx"
}
The error still persists. I would really appreciate any advice.
Thanks for sharing this code, and I am glad I have unused google cloud credits :)
Just found one tiny error while following the instructions of creating opensubtitles dataset
--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines-* \
====>
--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-* \
Hello!
I get the following error when trying to execute the create_data.py script in the Google Cloud Shell:
Traceback (most recent call last):
File "reddit/create_data.py", line 347, in <module>
run()
File "reddit/create_data.py", line 285, in run
p = beam.Pipeline(options=pipeline_options)
File "/home/user/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 203, in __init__
'Pipeline has validations errors: \n' + '\n'.join(errors))
ValueError: Pipeline has validations errors:
Missing required option: region.
I'm using the latest version of apache-beam, 2.23.0.
This repo processes datasets using GCP services. May I know if there is any tutorial to use your scripts to process raw data on local machines?
Thanks,
Peixiang
Hey, I am trying to install the requirements for this codebase via:
pip install -r requirements.txt
but am getting this error:
ERROR: No matching distribution found for apache-beam==2.5.0 (from -r conversational-datasets/requirements.txt (line 2))
I tried to fix this by installing apache-beam==2.5.0
via pip, but pip complains that it cannot find that distribution. After investigation, it looks like the latest version on pypi is 2.2.0
:
https://pypi.org/project/apache-beam/
Any suggestions on how to proceed? Thank you!
Hi Matthew, a quick question. I'm currently using the free trail of Google Cloud which does not allow me to change the quota settings. So far I can only use 7 workers and the data processing is very slow. Is it possible for me to get access to the raw data from other storage/computational platform or maybe download them to my local CPU cluster to do the data processing?
NameError: name 'Comment' is not defined [while running 'Normalise comments-ptransform-58']
i don't want to run this code on google cloud i just want it till ("Extract the data and split it into shards")
but i don't know how to do it can someone explain me how to run this commands
this one
PROJECT="your-google-cloud-project"
DATADIR="gs://${BUCKET?}/opensubtitles/$(date +"%Y%m%d")"
python opensubtitles/create_data.py
--output_dir ${DATADIR?}
--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-*
--runner DataflowRunner
--temp_location ${DATADIR?}/temp
--staging_location ${DATADIR?}/staging
--project ${PROJECT?}
--dataset_format TF
What if I want to test the model on a single example (context)? what will be the candidate responses to calculate relevance scores with the context? Are there any techniques to retrieve a number of response candidates from the large corpus? It would be helpful if you share some details regarding this. Thanks
This app is blocked
This app tried to access sensitive info in your Google Account. To keep your account safe, Google blocked this access.
INFO:apache_beam.runners.dataflow.dataflow_runner:Could not estimate size of source <apache_beam.io.gcp.bigquery._CustomBigQuerySource object at 0x7faa91a5c0d0> due to an exception: Traceback (most recent call last):
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1186, in run_Read
transform.source.estimate_size())
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery.py", line 764, in estimate_size
bq = bigquery_tools.BigQueryWrapper()
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 336, in init
self.gcp_bq_client = client or gcp_bigquery.Client(
AttributeError: 'NoneType' object has no attribute 'Client'
It used to work but not available now.
Do you think it makes sense to remove samples containing profanity?
Large datasets
Hi, I got the same error and solve it by updating the httplib2 to the latest version. Regarding the requirements, I also updated tensorflow==1.15.0 since version 1.14.0 gives me the following error: "No module named deprecation_wrapper".
Hmmm, did you change anything from the requirements.txt file other than update httplib2 to newest and updating tensorflow to 1.15.0? I did both of those things but now am getting a "No module named module_wrapper" error :(
Originally posted by @amorisot in #70 (comment)
I get this error
TypeError: Unicode-objects must be encoded before hashing [while running 'split train and test/ParDo(_TrainTestSplitFn)']
I want to download the Reddit dataset on my local machine in JSON format. How can I do it?
I (as well as numerous other users) would prefer our usernames and messages are not included in large datasets. Where can we submit a Request To Be Forgotten?
Please include support for python3, thanks :)
As the title says, using only comments leads to some information loss because the post is not used.
Any reasons?
Hi Matthew, I met this problem when I was running:
python reddit/create_data.py --output_dir ${DATADIR?} --reddit_table ${PROJECT?}:${DATASET?}.${TABLE?} --runner DataflowRunner --temp_location ${DATADIR?}/temp --staging_location ${DATADIR?}/staging --project ${PROJECT?} --dataset_format JSON --noauth_local_webserver
I1026 01:39:37.442483 139859749066496 dataflow_runner.py:177] 2020-10-26T01:39:37.193Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 766, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 482, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 254, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
import(module)
ImportError: No module named deprecation_wrapper
On using pip install apache-beam==2.5.0
I'm getting:
ERROR: tensorflow 1.12.0 has requirement tensorboard<1.13.0,>=1.12.0, but you'll have tensorboard 1.13.1 which is incompatible.
I would assume the error comes from
conversational-datasets/requirements.txt
Line 73 in 6282739
Will push a PR if we can confirm the issue.
Was wondering if it would be possible to release weights of the Bert models trained on reddit?
NameError: name 'Comment' is not defined
Was wondering if you any ideas on parsing these tf.Record files in pytorch?
In case this helps anyone: my dataflow job on Google Cloud was restricted to 1 worker, so I added the following argument when launching create_data.py:
--num_workers 8
Now the job was restricted to 4 workers instead due to the quota on external IP addresses. So I turned them off with:
--no_use_public_ips
and enabling Private Google Access for the used region's subnet (us-central1 for me) in the VPC networks section.
Hi,
I have downloaded the Amazon data (38 files) and ran the create_data.py by
python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON
It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.
What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.
P.S. I saw some filtering functions have been done in the create_data.py file.
Below are some statistics of the conversational dataset:
Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007
Thank you in advance for your kind reply.
I had finished downloading the reddit data and I can view them using the tfrutil.py module provided pretty nicely. The problem I'm encountering is that a lot of the comments seem to be cut off abruptly. For example:
Example 3
[Context]:
Quite a bit according to reddits own gold statistics.
[Response]:
I fucking clicked that without even checking the URL first. Goddamn it. You win the day fine sir.Extra Contexts:
[context/0]:
I would like to know how much was spent on Reddit gold for people posting a Rick Roll. I bet it's way more than $12 (though to
Example 6
[Context]:
Rather you're rewarded for planning ahead because you don't have a crutch "save me" button.
[Response]:
Not like DBM will tell you when things are coming..Extra Contexts:
[context/7]:
Warriors on suicide watch \n \n (I don't mind the gcd change on hunters actually. And the two ranged specs seem decent. Survival
[context/6]:
The Disengage on GCD is worse than any of the warriors spells.\n \n People made up stupid builds to show how you could go at
[context/5]:
I dunno, I got used to the disengage change pretty quick. It was annoying at first, mostly because you feel forced to stop
[context/4]:
Defensive abilities should be reactive and putting them locked because you're doing something else hurts the gameplay.\n \n
[context/3]:
To be fair Disengage isn't a purely defensive spell. I use disengage offensively all the time as a survival hunter, even as MM
[context/2]:
But you also use it to dodge problematic stuff while you're DPSing.
[context/1]:
Which is why it has reduced gcd
[context/0]:
But you're still penalized for playing to the best of your class.
As you can see, the comments seem to just be cut off like "(though to" and ". Survival", etc. As I'm using this data for what's intended for -- conversations, this could be pretty detrimental to the performance of the model since the continuation doesn't make sense. Any thoughts on this?
Thanks
edit: These examples are pulled from test-00001-of-00100.tfrecords
if it helps. I don't know if your train-test splitting is random for everyone.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.