mayankpande88 / my-learnings Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 0 B

my-learnings's People

Contributors

Stargazers

Watchers

my-learnings's Issues

Mock AWS S3 service for Spark or Hadoop component's unit test cases

We often write code which interacts with different services (AWS, databases) , To test this code we can connect to this services and execute our unit test cases. Interacting with the actual service during the test may have undesired side-effect like additional cost, network latency etc.

Solution for this would be to isolate unit test cases from cloud environment and minimize AWS storage and transfer cost is to the mock this managed services.

One case that often happens to me is the need to mock AWS services, in particular S3.

S3Proxy provides solution for the same. S3Proxy implements the S3 API and proxies requests, enabling testing s3 by using local file system.

Setting up the mocking server

For Java/Scala projects we can include the S3Proxy dependency via maven:

<dependency>
    <groupId>org.gaul</groupId>
    <artifactId>s3proxy</artifactId>
    <version>1.7.1</version>
</dependency>

Following is example for configuring the server and starting on port 8088:

Properties properties = new Properties();
properties.setProperty("jclouds.filesystem.basedir", "/tmp/blobstore");

BlobStoreContext context = ContextBuilder
        .newBuilder("filesystem")
        .credentials("identity", "credential")
        .overrides(properties)
        .build(BlobStoreContext.class);

S3Proxy s3Proxy = S3Proxy.builder()
        .blobStore(context.getBlobStore())
        .endpoint(URI.create("http://127.0.0.1:8088"))
        .build();

s3Proxy.start();
while (!s3Proxy.getState().equals(AbstractLifeCycle.STARTED)) {
    Thread.sleep(1);
}

We can quickly test whether our server started properly or not with below command.

$ curl --request PUT http://localhost:8080/testbucket
$ curl http://localhost:8080/
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>AWS authentication requires a valid Date or x-amz-date header</Message><RequestId>4442587FB7D0A2F9</RequestId></Error>

Mock S3 for Spark/Hadoop Components:

Hadoop/Spark applications uses Hadoop S3A client . So in order to use mock S3 server we need to set some Hadoop Configuration.

Configuration hadoopConf = session.sparkContext().hadoopConfiguration();

hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
hadoopConf.set("fs.s3a.endpoint", "http://localhost:8088");
hadoopConf.set("fs.s3a.access.key", "accessKey");
hadoopConf.set("fs.s3a.secret.key", "secretKey");

Below is the Sample JUnit test case for sample spark job with s3Proxy configuration

public class MyS3MockTest {
    AmazonS3 s3Client = null;
    @Rule
    public S3ProxyRule s3Proxy = S3ProxyRule.builder().withPort(8088)
            .withCredentials("accessKey", "secretKey").build();

    @Before
    public void initializeS3() {
        s3Client = AmazonS3ClientBuilder.standard()
                .withCredentials(new AWSStaticCredentialsProvider(
                        new BasicAWSCredentials(s3Proxy.getAccessKey(),
                                s3Proxy.getSecretKey())))
                .withEndpointConfiguration(
                        new EndpointConfiguration(s3Proxy.getUri().toString(),
                                Regions.US_EAST_1.getName()))
                .build();
        s3Client.createBucket("testbucket");
    }

    @Test
    public void mockS3Test() {
        SparkSession session = SparkSession.builder().master("local[*]")
                .getOrCreate();
        Configuration hadoopConf = session.sparkContext().hadoopConfiguration();
        hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
        hadoopConf.set("fs.s3a.endpoint", s3Proxy.getUri().toString());
        hadoopConf.set("fs.s3a.access.key", "accessKey");
        hadoopConf.set("fs.s3a.secret.key", "secretKey");

        hadoopConf.set("fs.s3a.attempts.maximum", "1");
        hadoopConf.set("fs.s3a.path.style.access", "true");
        hadoopConf.set("fs.s3a.multiobjectdelete.enable", "false");
        hadoopConf.set("fs.s3a.change.detection.version.required", "false");
        hadoopConf.set("fs.s3a.connection.establish.timeout", "20000");
        hadoopConf.set("fs.s3a.connection.timeout", "20000");

        JavaSparkContext jsc = JavaSparkContext
                .fromSparkContext(session.sparkContext());
        JavaRDD<String> jsonRDD = jsc.parallelize(Arrays.asList("val:1"));
        Dataset<Row> df = session.read().json(jsonRDD);
        df.write().json("s3://testbucket/myTestPath");

        org.junit.Assert.assertTrue(
                s3Client.doesObjectExist("testbucket", "/myTestPath"));
    }
}

Similar Projects to S3Proxy

fake s3
S3Mock
mock-s3

Data Pipeline using Kubernetes and Argo

What is Argo ?

Argo is Kubernetes CRD which allows you to run pipelines natively on your cluster. It allows to create directed acyclic graph of “steps” and execute them in serial or in parallel. It also provides UI for monitoring and to run workflows.
Argo adds a new object to Kubernetes called a Workflow, using this workflow object we can provide different steps.

Some of Argo’s features include:

Conditional execution
Artifacts
Timeouts and retry logic
Recursion and flow control
Suspend, resume, and cancellation

Installing Argo

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml

Example for Workflow

Lets create and workflow where task generate-parameter creates some data and task consume-parameter consumes parameter generated from generate-parameter task.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my-argo-wf
spec:
  entrypoint: output-parameter
  templates:
  - name: output-parameter
    steps:
    - - name: generate-parameter
        template: producer
    - - name: consume-parameter
        template: consumer
        arguments:
          parameters:
          - name: message
            value: "{{steps.generate-parameter.outputs.parameters.hello-param}}"

  - name: producer
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello-param
        valueFrom:
          default: "Foobar"
          path: /tmp/hello_world.txt

  - name: consumer
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo {{inputs.parameters.message}}"]

We can submit argo workflow using kubectl
kubectl apply -f my-argo-wf.yaml

Or we can use Argo CLI as well

argo submit my-argo-wf.yaml

Finally this is how it looks on argo UI

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.