While fixing <a class="issue-link js-issue-link" data-error-text="Failed to load title

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

did we try executor-memory like this: <code class="n

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Fix issues with `compose-controller-spark-sql-single.yaml` about fhir-data-pipes HOT 6 CLOSED

bashir2 commented on July 24, 2024

Fix issues with `compose-controller-spark-sql-single.yaml`

from fhir-data-pipes.

Comments (6)

bashir2 commented on July 24, 2024

Adding this to the Beta milestone as we should figure out the configuration knobs for single-process Spark version, if we want to show that to the users as an example.

from fhir-data-pipes.

bashir2 commented on July 24, 2024

For the 800K patients dataset, the transformed Parquet files are available at gs://fhir-analytics-test/OUT_from-json_791562. They are generated by directly reading the JSON files into our pipeline and converting the resources to Parquet.

from fhir-data-pipes.

atulai-sg commented on July 24, 2024

Hi @bashir2 , we do understand the machine which we are using is of really high spec and in general a single machine used will be of much more lower spec.. And we have seen the way customer used spark cluster separately, that would be an ideal thing to do..which is have a spark cluster separately and give the master url here in the trimmed compose file.. Having 800k patients and 160M Observations tested on a single machine sounds unrealistic to me.. and that is why I did not prioritise this issue..

from fhir-data-pipes.

atulai-sg commented on July 24, 2024

Lets say for some reason, we want to show case 800k on a single machine then we can use the other compose file compose-controller-spark-sql.yaml just for that purpose. I believe this should not be considered as a beta blocker.

from fhir-data-pipes.

atulai-sg commented on July 24, 2024

did we try executor-memory like this: ./sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m

from fhir-data-pipes.

bashir2 commented on July 24, 2024

Thanks @atulai-sg for the notes.

Re. data size, in general we should leave this to the user whether they want to use a single node or a cluster; but our suggested solution should not crash because of data size. IOW, it may take a very long time to process 800K patients and 160M observations on a single machine, but it should not crash. This is the part that makes it a beta blocker. I believe Spark can use the disk when it runs out of memory so basically even with small amount of memory, the large jobs like above should be doable, but possibly very slow.

Re. start-thriftserver.sh options, no I have not debugged it much and my guess is that there are some simple knobs to fix the memory issue. It just need someone to look into it.

from fhir-data-pipes.

Recommend Projects

Fix issues with `compose-controller-spark-sql-single.yaml` about fhir-data-pipes HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent