aphp / spark-etl Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 8.0 7.02 MB

Better bridge apache spark and postgresql

License: Apache License 2.0

Scala 84.33% Dockerfile 0.05% HTML 0.30% Shell 0.13% JavaScript 11.81% CSS 0.02% TypeScript 1.51% Java 1.86%

etl postgresql spark

spark-etl's People

Contributors

Stargazers

Watchers

Forkers

eugengorbachev guochunhong dmegginson phrasehealth doytsujin janevin dataedgesystems syncron-oss

spark-etl's Issues

pySpark compatibility

being able to run from pySpark

Store input csv to hdfs

Right now the csv is only written locally.
It should go into hdfs by default and locally if prefixed with "file://"

There might be good ideas in the sqoop direct import in particular the PostgresqlAsyncSink class

Optionally add a hash column

This will add a new column that calculate the hash of the columns.

A function would have :

dataframe
hashColumnName
column not to hash[]

Handle array fields csv

Might be interesting to:

load dataframe from csv with arrays
input/output from table with array columns

some ideas there:
read
write

Improve UUID use

temp folder
temp csv
spark temp tables
postgres temp tables

Make imports be parallel

By splitting a table in multiple query from multiple executors
this will speedup the data retrieval. Steps are:

get min/max
setup multiple executor
run multiple COPY FROM

spark-postgress class not found error

when we try to use this tool following instructions from readme, I am getting
java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class Is there any way to fix this ?

I keep getting this error : org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (compile)

Handle boolean fields

postgres returns f/t for boolean
transform it to false /true

Allow * in pgpass for databases

Add "reindex" argument

Allow deindexing before large bulk load and reindex the table afterward

Deindex

UPDATE pg_index 
SET indisready=false 
WHERE indrelid IN (
SELECT pg_class.oid FROM pg_class
JOIN pg_catalog.pg_namespace n ON n.oid = pg_class.relnamespace
WHERE relname='theTable' and nspname = 'theSchema'
);

Insert
Reindex

UPDATE pg_index 
SET indisready=true 
WHERE indrelid IN (
SELECT pg_class.oid FROM pg_class
JOIN pg_catalog.pg_namespace n ON n.oid = pg_class.relnamespace
WHERE relname='theTable' and nspname = 'theSchema'
);

REINDEX TABLE theSchema.theTable;

Provide both parallell level and partitions for inputBulk

Parallel level would be the number of concurrent postgres connection
while
partitions would be the resulting number of csv in hdfs.

This will be useful in case of large multiline csv, because they will be splitted into multiple
small csv instead of few large ones.

Add spark backed scd 1

goal is to let spark distinguish the changes, insert new rows, and update different rows. Steps are:

fetch both ID and HASH from the entire target table to spark
compare ID and HASH from target table and source table:
1. create new rows
2. create rows to update
insert new rows directly
insert rows to update in a temp table, and update

This way limits the postgresql computation.

sparkR compatibility

being able to run from sparkR