lsy3 / clinical-notes-diagnosis-dl-nlp Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 31.0 37.14 MB

Jupyter Notebook 86.46% OpenEdge ABL 5.23% Python 8.29% Shell 0.02%

clinical-notes-diagnosis-dl-nlp's People

Contributors

Stargazers

Watchers

clinical-notes-diagnosis-dl-nlp's Issues

SyntaxError: invalid syntax

In describe_icd9category.ipynb file, at line 50 in second block .map(lambda (hid, d): d[0])
SyntaxError: invalid syntax is throwing

@lsy3 Can you help me with this error

Request for the cleaned version of the dataset

Hello,

I am trying to replicate the project. I am following the procedure in README and you said that there is a google drive link which contains the cleaned version of the dataset. I searched for a link in Environment Setup(Local) and couldn't find it. It would be great if you can share the cleaned version of the dataset.

Regards,
Shashank.

Clinical-notes-diagnosis execution issue

Hello,
I have read your research work. Your work is really good. I am trying to execute your code. but when i am trying preprocess.py i am get the following below error. Please guide me about the issue.

Thanks,

Py4JJavaError Traceback (most recent call last)
in ()
2 t0 = time.time()
3
----> 4 df_id2texticd9, topicd9 = get_id_to_texticd9("hadm_id", 50)
5 df_id2texticd9.write.csv("/home/k163013/data/DATA_HADM", header=True)
6

in get_id_to_texticd9(id_type, topX, stopwords)
39 [id_] + sparse2vec(mapper, icd9) + [(text if len(stopwords)
40 == 0 else remstopwords(text))])
---> 41 return (spark.createDataFrame(ne_topX, ['id'] + topicd9 + ['text']), topicd9)

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
685
686 if isinstance(data, RDD):
--> 687 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
688 else:
689 rdd, schema = self._createFromLocal(map(prepare, data), schema)

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in _createFromRDD(self, rdd, schema, samplingRatio)
382 """
383 if schema is None or isinstance(schema, (list, tuple)):
--> 384 struct = self._inferSchema(rdd, samplingRatio, names=schema)
385 converter = _create_converter(struct)
386 rdd = rdd.map(converter)

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in _inferSchema(self, rdd, samplingRatio, names)
353 :return: :class:pyspark.sql.types.StructType
354 """
--> 355 first = rdd.first()
356 if not first:
357 raise ValueError("The first row in RDD is empty, "

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/rdd.pyc in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/rdd.pyc in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/context.pyc in runJob(self, rdd, partitionFunc, partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py in call(self, *args)
1158 answer = self.gateway_client.send_command(command)
1159 return_value = get_return_value(
-> 1160 answer, self.gateway_client, self.target_id, self.name)
1161
1162 for temp_arg in temp_args:

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 raise Py4JJavaError(
319 "An error occurred while calling {0}{1}{2}.\n".
--> 320 format(target_id, ".", name), value)
321 else:
322 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 46.0 failed 1 times, most recent failure: Lost task 7.0 in stage 46.0 (TID 3638, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "", line 34, in
TypeError: 'set' object is not callable

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:141)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/k163013/spark/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "", line 34, in
TypeError: 'set' object is not callable

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)

Error in preprocessing stage

Hello,

Thanks for the soon reply last time.

I have been compiling your code by following the README but I am getting a few error in the preprocess.py

Line 54 : It say didn't use the function properly and can you tell me the use of that line in RDD ?
Line 152: It says the filter is not defined properly. Can you explain me the line 151 and 152 ?
Line 198: It says unhashable type : set (Type Error)

I think there is some relation between the set() used in Line 151 and the error displayed in Line 198 and maybe the type in the sparse2vec function is not matching or not able to cast it with the main function.

Thanks in advance.

Regards,
Shashank Reddy Boosi.

Word2Vec - Feature Extraction Issue

Hello there,

I have three queries regarding the inputs to the feature extraction step:

1)Can you tell me from where did you get the bin file stated under

bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)

Because I have many sh files in it. I didn't understand how you got the bin file because in the Bio_NLP they showed the procedure for

pre-process.sh: segment and tokenized input text (e.g. raw PubMed or PMC text)
create_shf_low_text.sh: create lowercased and sentence-shuffled text (input: tokenized text)
createModel.sh: Create word2vec.bin file with different parameters
intrinsicEva.sh: run intrinsic evaluation on UMNSRS and Mayo data-set (input: Dir. for testing vector)
ExtrinsicEva.sh: run extrinsic evaluation

In the createModel.sh they say they got the bin file. So is that the bin file you are talking about or else should I execute anything else to get the PubMed-shuffle-win-*.txt

2)I searched for the

model_word2vec_v2_*dim.txt (generated word2vec)

and

model_doc2vec_v2_*dim_final.csv (generated word2vec)

that are used in the feature extraction ipython notebooks . So I checked the generators and found the doc2vec but I am not able to find the word2vec text file.

I have a doubt from which file you extracted the

TRAIN-VAL-TEST-HADMID.p

pickle file which is used as an input in the feature extraction non-seq notebook and what does that pickle file contain ?