aphp / spark-etl Goto Github PK
View Code? Open in Web Editor NEWBetter bridge apache spark and postgresql
License: Apache License 2.0
Better bridge apache spark and postgresql
License: Apache License 2.0
being able to run from pySpark
Right now the csv is only written locally.
It should go into hdfs by default and locally if prefixed with "file://"
There might be good ideas in the sqoop direct import in particular the PostgresqlAsyncSink class
This will add a new column that calculate the hash of the columns.
A function would have :
By splitting a table in multiple query from multiple executors
this will speedup the data retrieval. Steps are:
when we try to use this tool following instructions from readme, I am getting
java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
Is there any way to fix this ?
postgres returns f/t for boolean
transform it to false /true
Allow deindexing before large bulk load and reindex the table afterward
UPDATE pg_index
SET indisready=false
WHERE indrelid IN (
SELECT pg_class.oid FROM pg_class
JOIN pg_catalog.pg_namespace n ON n.oid = pg_class.relnamespace
WHERE relname='theTable' and nspname = 'theSchema'
);
UPDATE pg_index
SET indisready=true
WHERE indrelid IN (
SELECT pg_class.oid FROM pg_class
JOIN pg_catalog.pg_namespace n ON n.oid = pg_class.relnamespace
WHERE relname='theTable' and nspname = 'theSchema'
);
REINDEX TABLE theSchema.theTable;
Parallel level would be the number of concurrent postgres connection
while
partitions would be the resulting number of csv in hdfs.
This will be useful in case of large multiline csv, because they will be splitted into multiple
small csv instead of few large ones.
goal is to let spark distinguish the changes, insert new rows, and update different rows. Steps are:
This way limits the postgresql computation.
being able to run from sparkR
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.