my-learnings's People
my-learnings's Issues
Mock AWS S3 service for Spark or Hadoop component's unit test cases
We often write code which interacts with different services (AWS, databases) , To test this code we can connect to this services and execute our unit test cases. Interacting with the actual service during the test may have undesired side-effect like additional cost, network latency etc.
Solution for this would be to isolate unit test cases from cloud environment and minimize AWS storage and transfer cost is to the mock this managed services.
One case that often happens to me is the need to mock AWS services, in particular S3.
S3Proxy provides solution for the same. S3Proxy implements the S3 API and proxies requests, enabling testing s3 by using local file system.
Setting up the mocking server
For Java/Scala projects we can include the S3Proxy dependency via maven:
<dependency>
<groupId>org.gaul</groupId>
<artifactId>s3proxy</artifactId>
<version>1.7.1</version>
</dependency>
Following is example for configuring the server and starting on port 8088:
Properties properties = new Properties();
properties.setProperty("jclouds.filesystem.basedir", "/tmp/blobstore");
BlobStoreContext context = ContextBuilder
.newBuilder("filesystem")
.credentials("identity", "credential")
.overrides(properties)
.build(BlobStoreContext.class);
S3Proxy s3Proxy = S3Proxy.builder()
.blobStore(context.getBlobStore())
.endpoint(URI.create("http://127.0.0.1:8088"))
.build();
s3Proxy.start();
while (!s3Proxy.getState().equals(AbstractLifeCycle.STARTED)) {
Thread.sleep(1);
}
We can quickly test whether our server started properly or not with below command.
$ curl --request PUT http://localhost:8080/testbucket
$ curl http://localhost:8080/
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>AWS authentication requires a valid Date or x-amz-date header</Message><RequestId>4442587FB7D0A2F9</RequestId></Error>
Mock S3 for Spark/Hadoop Components:
Hadoop/Spark applications uses Hadoop S3A client . So in order to use mock S3 server we need to set some Hadoop Configuration.
Configuration hadoopConf = session.sparkContext().hadoopConfiguration();
hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
hadoopConf.set("fs.s3a.endpoint", "http://localhost:8088");
hadoopConf.set("fs.s3a.access.key", "accessKey");
hadoopConf.set("fs.s3a.secret.key", "secretKey");
Below is the Sample JUnit test case for sample spark job with s3Proxy configuration
public class MyS3MockTest {
AmazonS3 s3Client = null;
@Rule
public S3ProxyRule s3Proxy = S3ProxyRule.builder().withPort(8088)
.withCredentials("accessKey", "secretKey").build();
@Before
public void initializeS3() {
s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(
new BasicAWSCredentials(s3Proxy.getAccessKey(),
s3Proxy.getSecretKey())))
.withEndpointConfiguration(
new EndpointConfiguration(s3Proxy.getUri().toString(),
Regions.US_EAST_1.getName()))
.build();
s3Client.createBucket("testbucket");
}
@Test
public void mockS3Test() {
SparkSession session = SparkSession.builder().master("local[*]")
.getOrCreate();
Configuration hadoopConf = session.sparkContext().hadoopConfiguration();
hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
hadoopConf.set("fs.s3a.endpoint", s3Proxy.getUri().toString());
hadoopConf.set("fs.s3a.access.key", "accessKey");
hadoopConf.set("fs.s3a.secret.key", "secretKey");
hadoopConf.set("fs.s3a.attempts.maximum", "1");
hadoopConf.set("fs.s3a.path.style.access", "true");
hadoopConf.set("fs.s3a.multiobjectdelete.enable", "false");
hadoopConf.set("fs.s3a.change.detection.version.required", "false");
hadoopConf.set("fs.s3a.connection.establish.timeout", "20000");
hadoopConf.set("fs.s3a.connection.timeout", "20000");
JavaSparkContext jsc = JavaSparkContext
.fromSparkContext(session.sparkContext());
JavaRDD<String> jsonRDD = jsc.parallelize(Arrays.asList("val:1"));
Dataset<Row> df = session.read().json(jsonRDD);
df.write().json("s3://testbucket/myTestPath");
org.junit.Assert.assertTrue(
s3Client.doesObjectExist("testbucket", "/myTestPath"));
}
}
Similar Projects to S3Proxy
Data Pipeline using Kubernetes and Argo
What is Argo ?
Argo is Kubernetes CRD which allows you to run pipelines natively on your cluster. It allows to create directed acyclic graph of “steps” and execute them in serial or in parallel. It also provides UI for monitoring and to run workflows.
Argo adds a new object to Kubernetes called a Workflow, using this workflow object we can provide different steps.
Some of Argo’s features include:
- Conditional execution
- Artifacts
- Timeouts and retry logic
- Recursion and flow control
- Suspend, resume, and cancellation
Installing Argo
kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml
Example for Workflow
Lets create and workflow where task generate-parameter creates some data and task consume-parameter consumes parameter generated from generate-parameter task.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: my-argo-wf
spec:
entrypoint: output-parameter
templates:
- name: output-parameter
steps:
- - name: generate-parameter
template: producer
- - name: consume-parameter
template: consumer
arguments:
parameters:
- name: message
value: "{{steps.generate-parameter.outputs.parameters.hello-param}}"
- name: producer
container:
image: alpine:latest
command: [sh, -c]
args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
outputs:
parameters:
- name: hello-param
valueFrom:
default: "Foobar"
path: /tmp/hello_world.txt
- name: consumer
inputs:
parameters:
- name: message
container:
image: alpine:latest
command: [sh, -c]
args: ["echo {{inputs.parameters.message}}"]
We can submit argo workflow using kubectl
kubectl apply -f my-argo-wf.yaml
Or we can use Argo CLI as well
argo submit my-argo-wf.yaml
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.