Code Monkey home page Code Monkey logo

my-learnings's People

Contributors

mayanpande-saama avatar

Stargazers

 avatar

Watchers

 avatar

my-learnings's Issues

Mock AWS S3 service for Spark or Hadoop component's unit test cases

86-fixing-unit-tests

We often write code which interacts with different services (AWS, databases) , To test this code we can connect to this services and execute our unit test cases. Interacting with the actual service during the test may have undesired side-effect like additional cost, network latency etc.

Solution for this would be to isolate unit test cases from cloud environment and minimize AWS storage and transfer cost is to the mock this managed services.

One case that often happens to me is the need to mock AWS services, in particular S3.

S3Proxy provides solution for the same. S3Proxy implements the S3 API and proxies requests, enabling testing s3 by using local file system.

Setting up the mocking server

For Java/Scala projects we can include the S3Proxy dependency via maven:

<dependency>
    <groupId>org.gaul</groupId>
    <artifactId>s3proxy</artifactId>
    <version>1.7.1</version>
</dependency>

Following is example for configuring the server and starting on port 8088:

Properties properties = new Properties();
properties.setProperty("jclouds.filesystem.basedir", "/tmp/blobstore");

BlobStoreContext context = ContextBuilder
        .newBuilder("filesystem")
        .credentials("identity", "credential")
        .overrides(properties)
        .build(BlobStoreContext.class);

S3Proxy s3Proxy = S3Proxy.builder()
        .blobStore(context.getBlobStore())
        .endpoint(URI.create("http://127.0.0.1:8088"))
        .build();

s3Proxy.start();
while (!s3Proxy.getState().equals(AbstractLifeCycle.STARTED)) {
    Thread.sleep(1);
}

We can quickly test whether our server started properly or not with below command.

$ curl --request PUT http://localhost:8080/testbucket
$ curl http://localhost:8080/
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>AWS authentication requires a valid Date or x-amz-date header</Message><RequestId>4442587FB7D0A2F9</RequestId></Error>

Mock S3 for Spark/Hadoop Components:

Hadoop/Spark applications uses Hadoop S3A client . So in order to use mock S3 server we need to set some Hadoop Configuration.

Configuration hadoopConf = session.sparkContext().hadoopConfiguration();

hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
hadoopConf.set("fs.s3a.endpoint", "http://localhost:8088");
hadoopConf.set("fs.s3a.access.key", "accessKey");
hadoopConf.set("fs.s3a.secret.key", "secretKey");

Below is the Sample JUnit test case for sample spark job with s3Proxy configuration

public class MyS3MockTest {
    AmazonS3 s3Client = null;
    @Rule
    public S3ProxyRule s3Proxy = S3ProxyRule.builder().withPort(8088)
            .withCredentials("accessKey", "secretKey").build();

    @Before
    public void initializeS3() {
        s3Client = AmazonS3ClientBuilder.standard()
                .withCredentials(new AWSStaticCredentialsProvider(
                        new BasicAWSCredentials(s3Proxy.getAccessKey(),
                                s3Proxy.getSecretKey())))
                .withEndpointConfiguration(
                        new EndpointConfiguration(s3Proxy.getUri().toString(),
                                Regions.US_EAST_1.getName()))
                .build();
        s3Client.createBucket("testbucket");
    }

    @Test
    public void mockS3Test() {
        SparkSession session = SparkSession.builder().master("local[*]")
                .getOrCreate();
        Configuration hadoopConf = session.sparkContext().hadoopConfiguration();
        hadoopConf.set("fs.s3.impl", S3AFileSystem.class.getName());
        hadoopConf.set("fs.s3a.endpoint", s3Proxy.getUri().toString());
        hadoopConf.set("fs.s3a.access.key", "accessKey");
        hadoopConf.set("fs.s3a.secret.key", "secretKey");

        hadoopConf.set("fs.s3a.attempts.maximum", "1");
        hadoopConf.set("fs.s3a.path.style.access", "true");
        hadoopConf.set("fs.s3a.multiobjectdelete.enable", "false");
        hadoopConf.set("fs.s3a.change.detection.version.required", "false");
        hadoopConf.set("fs.s3a.connection.establish.timeout", "20000");
        hadoopConf.set("fs.s3a.connection.timeout", "20000");

        JavaSparkContext jsc = JavaSparkContext
                .fromSparkContext(session.sparkContext());
        JavaRDD<String> jsonRDD = jsc.parallelize(Arrays.asList("val:1"));
        Dataset<Row> df = session.read().json(jsonRDD);
        df.write().json("s3://testbucket/myTestPath");

        org.junit.Assert.assertTrue(
                s3Client.doesObjectExist("testbucket", "/myTestPath"));
    }
}

Similar Projects to S3Proxy

fake s3
S3Mock
mock-s3

Data Pipeline using Kubernetes and Argo

What is Argo ?

Argo is Kubernetes CRD which allows you to run pipelines natively on your cluster. It allows to create directed acyclic graph of “steps” and execute them in serial or in parallel. It also provides UI for monitoring and to run workflows.
Argo adds a new object to Kubernetes called a Workflow, using this workflow object we can provide different steps.

Some of Argo’s features include:

  • Conditional execution
  • Artifacts
  • Timeouts and retry logic
  • Recursion and flow control
  • Suspend, resume, and cancellation

Installing Argo

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml

Example for Workflow

Lets create and workflow where task generate-parameter creates some data and task consume-parameter consumes parameter generated from generate-parameter task.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my-argo-wf
spec:
  entrypoint: output-parameter
  templates:
  - name: output-parameter
    steps:
    - - name: generate-parameter
        template: producer
    - - name: consume-parameter
        template: consumer
        arguments:
          parameters:
          - name: message
            value: "{{steps.generate-parameter.outputs.parameters.hello-param}}"

  - name: producer
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello-param
        valueFrom:
          default: "Foobar"
          path: /tmp/hello_world.txt

  - name: consumer
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo {{inputs.parameters.message}}"]

We can submit argo workflow using kubectl
kubectl apply -f my-argo-wf.yaml

Or we can use Argo CLI as well

argo submit my-argo-wf.yaml

Finally this is how it looks on argo UI
Screenshot (159)

Screenshot (160)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.