Code Monkey home page Code Monkey logo

conversational-datasets's People

Contributors

coopie avatar geospith avatar matthen avatar pawel-polyai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conversational-datasets's Issues

Would you release the data on google drive?

Indeed, google cloud is not provided to developers in China so that I can't run the script with defalut storage space.....

Would you release other solution for download the data?

Thank you very much!

GCP Authentication Failure

Hi, I was just wondering what the fix is for this issue. For the reddit dataset, I have followed all the steps up to before executing:
python tools/tfrutil.py pp ${DATADIR?}/train-00999-of-01000.tfrecords

But when I do, I get this error:

2019-05-24 10:20:36.304120: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.753181 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.063359: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.220163 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.288225: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.174629 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.466637: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.375593 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:37.847847: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.587914 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:38.440436: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.06559 seconds (attempt 6 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:39.512649: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.777596 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:40.294343: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.71192 seconds (attempt 8 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:42.010957: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.02302 seconds (attempt 9 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:43.041673: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.138 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-24 10:20:44.186215: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
File "tools/tfrutil.py", line 118, in
_cli()
File "/Library/Python/2.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/Library/Python/2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Library/Python/2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Python/2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Python/2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "tools/tfrutil.py", line 46, in _pretty_print
for i, record in enumerate(tf.python_io.tf_record_iterator(path)):
File "/Library/Python/2.7/site-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
reader.GetNext()
File "/Library/Python/2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords.",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords."
}
}
'
when reading metadata of gs://reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords

I suppose this is due to it not being able to access my credentials, so I followed the instructions here:

https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances

and downloaded a <project>-<code>.json file with
{
"type": "service_account",
"project_id": "xxxx",
"private_key_id": "xxxxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY-----\n
xxxxxxx
\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "xxxxxxx",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "xxxxxxxxxxxxxx",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/xxxxxxx"
}

The error still persists. I would really appreciate any advice.

A tiny error in documentation

Thanks for sharing this code, and I am glad I have unused google cloud credits :)
Just found one tiny error while following the instructions of creating opensubtitles dataset

--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines-* \

====>

--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-* \

Missing required option: region โ€“ in Google Cloud

Hello!

I get the following error when trying to execute the create_data.py script in the Google Cloud Shell:

Traceback (most recent call last):
  File "reddit/create_data.py", line 347, in <module>
    run()
  File "reddit/create_data.py", line 285, in run
    p = beam.Pipeline(options=pipeline_options)
  File "/home/user/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 203, in __init__
    'Pipeline has validations errors: \n' + '\n'.join(errors))
ValueError: Pipeline has validations errors:
Missing required option: region.

I'm using the latest version of apache-beam, 2.23.0.

apache-beam==2.5.0 requirements error

Hey, I am trying to install the requirements for this codebase via:

pip install -r requirements.txt

but am getting this error:

ERROR: No matching distribution found for apache-beam==2.5.0 (from -r conversational-datasets/requirements.txt (line 2))

I tried to fix this by installing apache-beam==2.5.0 via pip, but pip complains that it cannot find that distribution. After investigation, it looks like the latest version on pypi is 2.2.0:

https://pypi.org/project/apache-beam/

Any suggestions on how to proceed? Thank you!

how to run ?

i don't want to run this code on google cloud i just want it till ("Extract the data and split it into shards")
but i don't know how to do it can someone explain me how to run this commands

this one

PROJECT="your-google-cloud-project"

DATADIR="gs://${BUCKET?}/opensubtitles/$(date +"%Y%m%d")"

python opensubtitles/create_data.py
--output_dir ${DATADIR?}
--sentence_files gs://${BUCKET?}/opensubtitles/raw/lines/lines-*
--runner DataflowRunner
--temp_location ${DATADIR?}/temp
--staging_location ${DATADIR?}/staging
--project ${PROJECT?}
--dataset_format TF

Response selection at test time

What if I want to test the model on a single example (context)? what will be the candidate responses to calculate relevance scores with the context? Are there any techniques to retrieve a number of response candidates from the large corpus? It would be helpful if you share some details regarding this. Thanks

The app is blocked

This app is blocked
This app tried to access sensitive info in your Google Account. To keep your account safe, Google blocked this access.

AttributeError: 'NoneType' object has no attribute 'Client'

INFO:apache_beam.runners.dataflow.dataflow_runner:Could not estimate size of source <apache_beam.io.gcp.bigquery._CustomBigQuerySource object at 0x7faa91a5c0d0> due to an exception: Traceback (most recent call last):
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1186, in run_Read
transform.source.estimate_size())
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery.py", line 764, in estimate_size
bq = bigquery_tools.BigQueryWrapper()
File "/home/alvynabranches/redit_data/venv/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 336, in init
self.gcp_bq_client = client or gcp_bigquery.Client(
AttributeError: 'NoneType' object has no attribute 'Client'

"No module named module_wrapper"

Hi, I got the same error and solve it by updating the httplib2 to the latest version. Regarding the requirements, I also updated tensorflow==1.15.0 since version 1.14.0 gives me the following error: "No module named deprecation_wrapper".

Hmmm, did you change anything from the requirements.txt file other than update httplib2 to newest and updating tensorflow to 1.15.0? I did both of those things but now am getting a "No module named module_wrapper" error :(

Originally posted by @amorisot in #70 (comment)

Chinese Data

I get this error

TypeError: Unicode-objects must be encoded before hashing [while running 'split train and test/ParDo(_TrainTestSplitFn)']

Local Download

I want to download the Reddit dataset on my local machine in JSON format. How can I do it?

How to submit RTBF

I (as well as numerous other users) would prefer our usernames and messages are not included in large datasets. Where can we submit a Request To Be Forgotten?

No module named deprecation_wrapper

Hi Matthew, I met this problem when I was running:

python reddit/create_data.py --output_dir ${DATADIR?} --reddit_table ${PROJECT?}:${DATASET?}.${TABLE?} --runner DataflowRunner --temp_location ${DATADIR?}/temp --staging_location ${DATADIR?}/staging --project ${PROJECT?} --dataset_format JSON --noauth_local_webserver

I1026 01:39:37.442483 139859749066496 dataflow_runner.py:177] 2020-10-26T01:39:37.193Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 766, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 482, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 254, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
import(module)
ImportError: No module named deprecation_wrapper

Line 114

NameError: name 'Comment' is not defined

Using pytorch

Was wondering if you any ideas on parsing these tf.Record files in pytorch?

Get more workers with Google Cloud's free trial

In case this helps anyone: my dataflow job on Google Cloud was restricted to 1 worker, so I added the following argument when launching create_data.py:
--num_workers 8
Now the job was restricted to 4 workers instead due to the quota on external IP addresses. So I turned them off with:
--no_use_public_ips
and enabling Private Google Access for the used region's subnet (us-central1 for me) in the VPC networks section.

AmazonQA Data Size

Hi,

I have downloaded the Amazon data (38 files) and ran the create_data.py by

python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON

It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.

What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.

P.S. I saw some filtering functions have been done in the create_data.py file.

Below are some statistics of the conversational dataset:

Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007

Thank you in advance for your kind reply.

Reddit dataset seems to not be showing the full comment

I had finished downloading the reddit data and I can view them using the tfrutil.py module provided pretty nicely. The problem I'm encountering is that a lot of the comments seem to be cut off abruptly. For example:

Example 3

[Context]:
Quite a bit according to reddits own gold statistics.
[Response]:
I fucking clicked that without even checking the URL first. Goddamn it. You win the day fine sir.

Extra Contexts:
[context/0]:
I would like to know how much was spent on Reddit gold for people posting a Rick Roll. I bet it's way more than $12 (though to

Example 6

[Context]:
Rather you're rewarded for planning ahead because you don't have a crutch "save me" button.
[Response]:
Not like DBM will tell you when things are coming..

Extra Contexts:
[context/7]:
Warriors on suicide watch \n \n (I don't mind the gcd change on hunters actually. And the two ranged specs seem decent. Survival
[context/6]:
The Disengage on GCD is worse than any of the warriors spells.\n \n People made up stupid builds to show how you could go at
[context/5]:
I dunno, I got used to the disengage change pretty quick. It was annoying at first, mostly because you feel forced to stop
[context/4]:
Defensive abilities should be reactive and putting them locked because you're doing something else hurts the gameplay.\n \n
[context/3]:
To be fair Disengage isn't a purely defensive spell. I use disengage offensively all the time as a survival hunter, even as MM
[context/2]:
But you also use it to dodge problematic stuff while you're DPSing.
[context/1]:
Which is why it has reduced gcd
[context/0]:
But you're still penalized for playing to the best of your class.

As you can see, the comments seem to just be cut off like "(though to" and ". Survival", etc. As I'm using this data for what's intended for -- conversations, this could be pretty detrimental to the performance of the model since the continuation doesn't make sense. Any thoughts on this?

Thanks

edit: These examples are pulled from test-00001-of-00100.tfrecords if it helps. I don't know if your train-test splitting is random for everyone.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.