uswitch / blueshift Goto Github PK
View Code? Open in Web Editor NEWAutomate copying data from S3 into Amazon Redshift
License: Eclipse Public License 1.0
Automate copying data from S3 into Amazon Redshift
License: Eclipse Public License 1.0
Great project. One concern about passwords. According to the design.md documentation, the manifest.edn file containing the JDBC connection username and password sits in S3.
Shouldn't this info be better placed in the config.edn file ?
-A
so over the weekend s3 eventual consistency on list operations because really slow and it seems the s3 list index was retuning files that were actually deleted, therefore i was getting COPY errors.
So i think ill need to run through the list and do a HEAD GET for each file to make sure its 100% there before adding to the manifest correct?
select recordtime,query, sliceid, substring(key from 1 for 20) as file,
substring(error from 1 for 300) as error,bucket,key
from stl_s3client_error
2015-04-27 20:12:18.131621 | 1498948 | 31 | staging/y=2015_m=4_d | s3-us-west-2.amazonaws.com 54.231.161.48 S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid C61A58E48CCA899D,ExtRid p5mfdOb0otnryj7yxur/iM/TS9RI+DfjqA58WL5GSlMZhEJsPJENIvHDRUFfCEkEA7VkQ7frq9U=,CanRetry 1,Abort: 5 5 5. | datasnap-events-prod-flattened-redshift-04-13-2015 | staging/y=2015_m=4_d=27_H=18_M=15_80deb2ee-2bc9-4c44-9a4f-b5f4b1e07e87.gz
When an import fails query the stl_load_errors table for the errors during that transaction, store them in a file on S3 so its easier to see when/why imports don't process- most likely due to a schema problem or a problem with the data itself.
we have a continuous stream of events coming into our API and placing small amounts of files in the s3 bucket and they are being processed fine.
Sometimes we import form customers very large amount of smaller files and when we do this we see a large manifest file being created and places at the root location.. but as the continous stream of events keep coming we notice that the initial COPY ANALYZE gets aborted and then it just starts importing the small files in a separate manifest and sortof "forgets" about the large one. Only when i restart blueshift does it reread the folder and create a new manifest and kick of the proper COPY.
is blueshift somehow aborting the large copy when it processes another smaller manifest and then forgets about it?
I am going to dig into the code a little myself and see if I can follow whats going on there.
Sometimes after restarting it will run the COPY then it aborts:
https://www.dropbox.com/s/a1qdmnc6dsnyjfk/Screenshot%202015-02-03%2012.14.06.png?dl=0
Ensure blueshift handles manifests that fail validation errors. Should log the error and try again after the usual delay.
clojure.lang.ExceptionInfo: Invalid manifest. Check map for more details.
at clojure.core$ex_info.invoke(core.clj:4403)
at uswitch.blueshift.s3$validate.invoke(s3.clj:26)
at uswitch.blueshift.s3.KeyWatcher$fn__8238$fn__8239.invoke(s3.clj:98)
at uswitch.blueshift.s3.KeyWatcher$fn__8238.invoke(s3.clj:95)
at clojure.core.async$thread_call$fn__5297$fn__5298.invoke(async.clj:390)
at clojure.core.async$thread_call$fn__5297.invoke(async.clj:390)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Blueshift's default strategy is to merge new data with the existing. This follows Amazon's recommended method (see http://docs.aws.amazon.com/redshift/latest/dg/merge-examples.html for examples).
The delete statement seems to be very slow when run with tables that have primary keys that aren't also sort or distribution keys.
We have alot of small s3 files and the delete operation ins blueshift is the biggest bottleneck right now.
Any suggestions since i dont know implications of using something like pmap offhand?
(defrecord Cleaner [credentials bucket cleaner-ch]
Lifecycle
(start this
(thread
(loop [](when-let [m %28<!! cleaner-ch%29]
%28doseq [key %28:files m%29]
%28info "Deleting" %28str "s3://" bucket "/" key))
(delete-object credentials bucket key))
(recur))))
this)
(stop this
this))
I had this software working a few weeks ago, but now it does not work.
The S3 bucket scanner finds my two directories with 'manifest.edn' files, but does not add them to the
"watched directories" list. The list stays length 0, and no files are uploaded to Postgres (Redshift).
The DEBUG printout shows that it finds the manifest.edn file. There is no printout about "I found a manifest.edn file but did not like it".
The JDBC URL that appears in Redshift's dashboard starts with jdbc:redshift://... . However, Blueshift expects jdbc:postgresql://... in the manifest file.
When using the first one, Blueshift raises this error when trying to connect to Redshift:
Error loading into blueshift java.sql.SQLException: No suitable driver found for jdbc:redshift://...
This is because Blueshift uses the Postgresql driver instead of the Redshift driver.
I think that this is a common problem that users might have. So it may be a good idea to check this in the manifest validation and alert the user with a custom error instead of raising the generic SQLException.
What do you think?
i seem to get these alot lately since our app is running alot of SELECT statement now and its always against the table and the time that blueshift is loading data against the table.
org.postgresql.util.PSQLException: ERROR: 1023 Detail: Serializable isolation violation on table
i have an open ticket with amazon and see suggestions on locking exclusive the table but blueshift is not doing that i noticed. does it need to or am i missing something?
Getting following error while importing into redshift, although that it seems that all data were imported well.
Any clues why is this happens ?
[14:30:24.130] ERROR uswitch.blueshift.redshift - ROLLBACK
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:392)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:330)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:321)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at uswitch.blueshift.redshift$execute$fn__9063.invoke(redshift.clj:103)
at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
at clojure.lang.RestFn.invoke(RestFn.java:482)
at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114$fn__9115.invoke(redshift.clj:160)
at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:99)
at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114.invoke(redshift.clj:159)
at uswitch.blueshift.redshift.Loader$fn__9111.invoke(redshift.clj:159)
at clojure.core.async$thread_call$fn__5301$fn__5302.invoke(async.clj:390)
at clojure.core.async$thread_call$fn__5301.invoke(async.clj:390)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[14:30:24.292] ERROR uswitch.blueshift.redshift - Error loading into CM_rev
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:392)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:330)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:321)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at uswitch.blueshift.redshift$execute$fn__9063.invoke(redshift.clj:103)
at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
at clojure.lang.RestFn.invoke(RestFn.java:482)
at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114$fn__9115.invoke(redshift.clj:160)
at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:99)
at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114.invoke(redshift.clj:159)
at uswitch.blueshift.redshift.Loader$fn__9111.invoke(redshift.clj:159)
at clojure.core.async$thread_call$fn__5301$fn__5302.invoke(async.clj:390)
at clojure.core.async$thread_call$fn__5301.invoke(async.clj:390)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Wanted to get your thoughts.
In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.
The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.
Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.
Other modes could potentially be created if necessary or desired.
I haven't implemented this yet, but am considering it.
My questions are:
1. What are your thoughts on this in general?
2. Do you desire this kind of behavior (or something similar)?
3. Would you eventually want something like this merged back into Blueshift?
4. What are your thoughts around semantics of how to specify these Modes? Config file semantics? Manifest file semantics?
Thanks for your thoughts.
-Avram
I'm on Java 7 (1.0.79) on Linux Ubuntu 14.4 (Mint). I built Blueshift a few weeks ago and set it up. It ran fine.
But, now it does not. At startup I get this:
[2015-08-28 20:36:41,430] INFO uswitch.blueshift.main - Starting Blueshift with configuration elb.config.edn
[2015-08-28 20:36:41,447] INFO uswitch.blueshift.telemetry - Loading reporter uswitch.blueshift.telemetry/log-metrics-reporter for registry #<MetricRegistry com.codahale.metrics.MetricRegistry@c909572>
[2015-08-28 20:36:41,474] INFO uswitch.blueshift.s3 - Starting BucketWatcher. Polling edmodo.hadoop.export every 30 seconds for keys matching #"elblogs.*"
[2015-08-28 20:36:41,480] INFO uswitch.blueshift.s3 - Starting KeyWatcherSpawner
[2015-08-28 20:36:41,489] INFO uswitch.blueshift.telemetry - Starting Slf4j metrics reporter
This happens without any JVM stack memory arguments. Also, have tested various combinations of memory arguments. Here is a sample startup line:
java -Dlogback.configurationFile=./etc/logback.xml -Xss325k -Xmx10g -jar target/blueshift-0.1.0-SNAPSHOT-standalone.jar --config elb.config.edn
Do you have a time-tested combination of memory flags that will support 5-10 polled directories?
(I am filing an issue as there is no mailing list/google groups etc.)
If I have multiple unimported csv files, and each has one change of same record, that record is written multiple times in database. Is there a way to tell blueshift to import file by file?
I am getting this error sometimes when running blueshift. It loads data "sometimes" but when it gets this error is does not BUT it still removes the data files form the bucket so the data does not get loaded.
Can someone help me troubleshoot this or point me to how to get more forensics info?
Thanks!
_rules_time_trigger_sunday,campaign_targeting_rules_communication_ids_idx_0,campaign_targeti
ng_rules_communication_ids_idx_1,campaign_targeting_rules_communication_ids_idx_2,campaign_t
argeting_rules_communication_ids_idx_3,campaign_targeting_rules_communication_ids_idx_4,data
snap_timestamp,datasnap_created,datasnap_txn_id,datasnap_customer_txn_id,datasnap_device_use
r_agent,datasnap_device_ip_address,datasnap_device_platform,datasnap_device_os_version,datas
nap_device_model,datasnap_device_manufacturer,datasnap_device_name,datasnap_device_vendor_id
,datasnap_device_carrier_name,datasnap_device_country_code,datasnap_device_network_code) FRO
M 's3://datasnap-events-staging-latest-flattened-redshift/c954b50a-29d4-4ce7-a623-91afe752e7
07.manifest' CREDENTIALS 'aws_access_key_id=AKIAJLGOZADWN3CH5NSA;aws_secret_access_key=EUxsq
wnJzdJUmSzfZ+ZlhupPkfaden23CL27mVXX' TIMEFORMAT 'auto' EMPTYASNULL BLANKSASNULL manifest
[async-thread-macro-3] ERROR uswitch.blueshift.redshift - ROLLBACK
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.j
ava:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:12
83)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:3
92)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Stateme
nt.java:330)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:3
21)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java
:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at uswitch.blueshift.redshift$execute$fn__10871.invoke(redshift.clj:103)
at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
at clojure.lang.RestFn.invoke(RestFn.java:482)
at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
at uswitch.blueshift.redshift.Loader$fn__10919$fn__10922$fn__10923.invoke(redshift.$lj:160)
at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:99)
at uswitch.blueshift.redshift.Loader$fn__10919$fn__10922.invoke(redshift.clj:159)
at uswitch.blueshift.redshift.Loader$fn__10919.invoke(redshift.clj:159)
at clojure.core.async$thread_call$fn__6407$fn__6408.invoke(async.clj:390)
at clojure.core.async$thread_call$fn__6407.invoke(async.clj:390)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[async-thread-macro-3] ERROR uswitch.blueshift.redshift - Error loading into raw_events_fin$l_temp
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.$ava:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1$83)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:$92)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.