hujun / blog Goto Github PK

View Code? Open in Web Editor NEW

7.0 1.0 0.0 157 KB

post in issues

blog's Introduction

Blog

Contents:

All contents follow CC BY NC 4.0 License.

blog's People

Contributors

Stargazers

Watchers

blog's Issues

Review of Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow

Conclusion at first, this is the BEST book at the current moment (2019) about machine learning engineering practice, not only for the beginners but also for the veterans. Though the edition I read is only the review edition, it is still so impressing. I will buy a formal edition without any hesitation when it is released.

As we all know, the most important part in learning computer science and machine learning is practice. It is also the hardest part, especially for the beginners. The beginners mentioned here are not the green hands who know nothing about CS or ML theories. It would be sometimes also very difficult for the experienced developers and researchers when they begin to get into a new topic, or try to use a new framework, because it usually concerns a lot of tricks, nouns and existing unfamiliar patterns. It's why there are some many books with "hands-on" and "in practice" in title. But at the same time, not many of them are worthy reading. Obviously, Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow is one of the real masterpieces among these "hands-on" series.

This 2nd edition inherits most of contents from its previous edition (Hands-on Machine Learning with Sklearn & Tensorflow). You can find all the differences info in the preface of the book. It is very helpful for the readers who have already got the 1st edition. All code from the book is hosted on github. The author is continuously updating and patching the code. For the experienced machine learning veterans who only care about practical skills such as parameter tuning, model selection, framework usage etc., these up to date Jupyter Notebook code is true treasure. All details are there. You can have a brief conception of how modern ML works no matter how little you know about ML & DL theories. Compared to this book, more than 90% of the other "hands-on" book are just trash for completeness and originality. Author also gives many hints and easily understanding explanation in the book. I have to say these suggestions are highly valuable for pitfall avoiding. I guess that many people of you would have the same feeling as me, just because you have spent so much time in exhausting bug fixing while playing with sklearn, keras and tensorflow. Considering the saving of time, this book is just too cheap!

Of course, this book is not a good teaching material for machine learning & deep learning theory details. If you want to have deeper and more details about the theory, other classic books are no doubt much better, like Deep learning by Ian Goodfellow, Machine Learning by ZHOU Zhihua and so on. Personally, I recommend to reference these materials each other when you read them. A effective learning is always a spiral circle between theory and practice.

Review of Designing Data-Intensive Applications (DDIA)

This book has being become a must-read since its first publish from 2017. Actually, it is a very worthy reading book. I will give it a five-star comment and recommend every programmer to have a look at it, no matter he/she is a beginner or a senior.

DDIA is easy to ready book because of its detailed explanation in fundamental concepts. But it is still hard to understand the content in depth because it needs a rich engineering experience to follow author's thinking from a practical problem solving thinking. Therefore there are two total different ways to read the book. For the beginners, it is recommended to just go through all the chapters. It is important to know the basics about database, distributed system and other concepts. Personally, I like the second part (Distributed Data) a lot. How to partition big data and how to keep data consistency across partitions are actually a big part of my daily work. And surprisingly, I always encounter programmers (even very senior ones) don't really know the basic concepts such as transactions, isolation levels, 2PL, MVCC etc. So I highly recommend to take time to read this part. For the senior and experienced developers, DDIA is a very good reference. You can just start from the last chapter (The future of data system), and jump into necessary details when you feel not familiar with some idea. The references after every chapter's summary are also very good if you need to go deeper into particular part of it.

Martin Kleppmann encourages a lot an architecture based on derived data, message passing, and stateless processors. It is also a in facto best practice existing everywhere for middle or large scale backend applications. The reason is very simple and mentioned dozens of times in the book, it is very hard and unpredicted to keep the system efficient and reliable when it is inevitably split into small pieces (micro-services). Similar ideas are implemented in different solutions and architectures. From my point of view, the most important thing to learn from the book and to remember, is to always keep in mind the very fundamental data processing concepts and trade-off between different technologies.

CI/CD Pipeline Using GitLab and Rancher

We have being talked about CI/CD for years. And like other "every one knows but no one knows how" conceptions, a lot of practices, procedures, speeches and useless slides are published but seldom is useful. I don't want to complain that there are so many people are using the conceptions before they really know what exactly it is and how it works. I have to say, in most cases, they just failed to find appropriate tools and "savoir-faire" to solve the problems.

In this post, I want to introduce a CI/CD pipeline solution based on GitLab and Rancer v1.6. It may be not the best one, but I still hope it can be inspiring for your needs.

Overview

Seeing the above diagram, for easier understanding, we can simply split the whole procedure into two consequential parts of CI and CD. The CI part is mainly rely on GitLab CI, and the CD part lives in Rancher scope. GitLab CI gives a whole life cycle of development-build-deploy with sophisticated workflow settings, while Rancher takes care of deployment related issues such as environment management, hardware resource management, service composing, monitoring and service scalability.

Evaluation Standards

Before get into more details, I want to state at first clearly my criterions of CI/CD pipelines. Because my considerations may not fit your circumstances. To avoid unnecessary arguments and waste of time, I suggest you read all these standards before continue to the following paragraphs.

According to the Occam Razor theory, I hope to keep the system as simple as possible. I dont want to use too many tools and components.
The integration of components in the pipeline should based on triggers (http request, message etc.) but not configuration.
Reuse existing best practise as mush as possible.
Offer user friendly procedures. Most operations and configurations should be done on GUI. Because a good CI/CD solution should involve all stack holders of software development, even she or he does not know much about programming, shell manipulation or devop tools.

Why GitLab CI

I believe no one will use other version control system other than Git nowadays. Among three mainstream git tools (github, bitbucket and gitlab), up to now, gitlab is the only one that has full CI function integrated, while the other two competitors can only use webhook to integrate 3rd party CI tools, which means more complexity of system and procedures.

I have to say that GitLab CI is quite complete in functionality and easy to use. Even for community edition, you are free to use these CI functions with well edited documents. I guess it is a commercial strategy of GitLab to compete with GitHub. Obviously, it did a good job.

Why Containerization (docker)

It is not a necessary condition to use container for CI/CD. But without containerization or Docker, it will cost much more for reach the same level of:

deployment & configuration
fine grained service control
a thriving ecology of tools and methologies
cost of hardware/software resources
service scalability

I will not comment too much here the details about benefits and features of docker. I believe it has become a basic know-how to all of the developers who are living in modern software world.

Of course you may be master of ansible or saltstack, and are able to implement effortlessly a CI/CD pipeline with them. I'm not against to this kind of solution, and I have to admit that in some scenarios, it is more efficient. The CI/CD solution without docker is out of scope of this post. In fact, it is more suitable for system initializations and fundamental components deployment, but not for services.

Why Rancher

Here it comes the "vs" part. The question of "why xxx?" equals to "why not yyy?". I list the competitors of rancher below (please remind me if I miss any):

AliCloud Container Service is now simply Kubernates on AliCloud, no need to discuss in detail.

AWS ECS is perfect to manage your services only if all of your services run on AWS EC2. In other word, the most significant flaw of ECS is lack of cross-datacenter (cloud) capability. Take myself as example, most of my services run on AliCloud. In order to coordinate resources on both AWS and AliCloud, I have to introduce other component (e.g. Terraform) and more complex workflow & procedures, which means a lot of work burden and potential mainteinance efforts. Furthermore, orchestration tools in ECS as auto-scaling, secret management, auto load balancing and etc. is not so good as in rancher. All these cause ECS a small scale service suitable solution for only EC2 hosts.

Nomad is an elegantly designed system. I'm happy to use it in prototype systems. Simplicity results in beauty, and drawbacks as well. The most obvious shortage of nomad is lack of UI. From my view, a good CI/CD solution should involve all the roles of development including developers, testers, PMs etc. Not all of them can well understand all these "magic" tools and operations. A friendly UI sometimes is critical for introduction of CI/CD solution in the company. And nomad has no native service discovery, and no real configuration management neither. You have to make it up using other tools like consul and etcd.

Similar drawbacks also can be found in Docker swarm. Though Docker swarm has advantage of native support of docker engine. If you only want to cluster your containers, it is no doubt your No.1 option. But if we evaluate it from the viewpoint of whole CI/CD workflow, the missing puzzle of UI, service discovery and configuration management is not so ignorable. Nevertheless, thanks to its native interoperability with docker engine, it works very well as infrastructure layer of orchestration tool.

Mesos & Marathon based DC/OS is very like to rancher. The UI is better in my eyes. In functionalities, it has almost all the same as those in rancher, except rich authentication support. But unfortunately, DC/OS has so limited function compared to the enterprise edition. I take DC/OS as a luxury edition of Mesos, and its community edition is just for demo.

Now we finally come to the most famous one, Kubernates. I have to admit that Kubernates have all the features you can imagine. Theoretically, you can use it to do everything in CD. Containerization, service scale, smart clustering, configuration management, logging & monitoring, high availability of service, auto scale out, etc. Kubernates is now the best solution for infrastructure management, or I can say, there is no second option. But in the end I leave it in the infrastructure layer, not on the service management layer of CD. The reason has been mentioned above in my comment on nomad. Compared to the UI operation of rancher, Kubernates pod configuration is too much more devop oriented. And I don't want to expose and mix too many configs of infrastructures with service deployment configs. Someone may argue that the infrastructure & resource config and service deployment config are essentially not different. But from my personal point of view, service deployment config is more business logic mapping staff, and I'm not willing to muddle so many things together in one place, especially when the procedure concerns dozens of departments and hundreds of users in different roles. Fortunately, rancher supports Kubernates template, which means we can manage or create Kubernates clusters directly in rancher. It is really very cool. I always see Kubernates as a perfect resource pool management tool, and rancher allows a perfect usage of it.

Conclusion:

AWS ECS is limited in Amazon ecology and not good in service orchestration
Nomad and Swam has no UI, service discovery nor configuration managment
DC/OS is good, but only for enterprise edition
Kubernates is perfect for infrastructure management but not easy to use in service config of CD.
Rancher provides full feature of service configuration of deployment, scaling and clustering, has nice UI and powerful API with detailed document. It also supports Kubernates, Swarm and mesos as underlying layer.

Please notice that all these conclusion and comments are from my evaluation standards mentioned above. You may come with totally different opinions from a totally different consideration origin.

Implement Step by Step

No installation instructions of Gitlab & CI runner and rancher will be repeated here. I believe you can easily find these information on their official websites. I will only explain the tricky parts and try to introduce my own workflow of CI (maybe not suitable for you).

Rancher behind Nginx Proxy in Docker

The official document gives a good example of nginx config file for proxy deployment. But if you try to deploy rancher and nginx together using docker and has difficulty in configuration files, you can find a complete example here

.gitlab-ci.yml

My CI config as below:

variables:
  DOCKER_REGISTRY_URL: "your_private_docker_registry_url"
  # rancher server API endpoint
  # must with http scheme
  RANCHER_ENDPOINT_URL: "your_rancher_api_endpoint_url"

stages:
  - build
  - test
  - staging
  - deploy

before_script:
  - 'echo "PROJECT NAME : $CI_PROJECT_NAME"'
  - 'echo "PROJECT ID : $CI_PROJECT_ID"'
  - 'echo "PROJECT URL : $CI_PROJECT_URL"'
  - 'echo "ENVIRONMENT URL : $CI_ENVIRONMENT_URL"'
  - 'echo "DOCKER REGISTRY URL : $DOCKER_REGISTRY_URL"'
  - 'export PATH=$PATH:/usr/bin'

# after_script:

build_image:
  stage: build
  only:
    - master
    - develop
    - staging
  when: manual
  allow_failure: false
  script:
    - 'echo "Job $CI_JOB_NAME triggered by $GITLAB_USER_NAME ($GITLAB_USER_ID)"'
    - 'echo "Build on $CI_COMMIT_REF_NAME"'
    - 'echo "HEAD commit SHA $CI_COMMIT_SHA"'
    # docker repo name must be lowercase
    - 'PROJECT_NAME_LOWERCASE=$(tr "[:upper:]" "[:lower:]" <<< $CI_PROJECT_NAME)'
    - 'IMAGE_REPO=$DOCKER_REGISTRY_URL/$PROJECT_NAME_LOWERCASE/$CI_COMMIT_REF_NAME'
    - 'IMAGE_TAG=$IMAGE_REPO:$CI_COMMIT_SHA'
    - 'IMAGE_TAG_LATEST=$IMAGE_REPO:latest'
    - 'docker build -t $IMAGE_TAG -t $IMAGE_TAG_LATEST .'
    - 'OLD_IMAGE_ID=$(docker images --filter="before=$IMAGE_TAG" $IMAGE_REPO -q)'
    - '[[ -z $OLD_IMAGE_ID ]] || docker rmi -f $OLD_IMAGE_ID'
    - 'docker push $IMAGE_TAG'
    - 'docker push $IMAGE_TAG_LATEST'

deploy_test:
  stage: test
  only:
    - develop
  when: manual
  environment:
    name: test
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_TEST
  script:
    - 'echo "Deploy for test"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

deploy_validate:
  stage: staging
  only:
    - staging
  when: manual
  environment:
    name: staging
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_STAGING
  script:
    - 'echo "Deploy for validation"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

deploy_production:
  stage: deploy
  only:
    - master
  when: manual
  environment:
    name: production
  variables:
    CI_RANCHER_ACCESS_KEY: $CI_RANCHER_ACCESS_KEY_TEST
    CI_RANCHER_SECRET_KEY: $CI_RANCHER_SECRET_KEY_TEST
    CI_RANCHER_STACK: $CI_RANCHER_STACK
    CI_RANCHER_SERVICE: $CI_RANCHER_SERVICE
    CI_RANCHER_ENV: $CI_RANCHER_ENV_PROD
  script:
    - 'echo "Deploy for production"'
    - 'ceres rancher_deploy --rancher-url=$RANCHER_ENDPOINT_URL --rancher-key=$CI_RANCHER_ACCESS_KEY --rancher-secret=$CI_RANCHER_SECRET_KEY --service=$CI_RANCHER_SERVICE --stack=$CI_RANCHER_STACK --rancher-env=$CI_RANCHER_ENV'

You will find more info about GitLab CI configuration here if needed. I believe you can well understand the workflow defined in yaml file after understanding GitLab CI configurations.

Deploy using Rancher

In the job of deployment defined in above GitLab CI config file, there is only one command starts with "ceres rancher_deploy". What does it mean? In fact, "ceres" is a installed command (python package) in GitLab CI runner environment. "rancher_deploy" is a subcommand of "ceres". Full document of "rancher_deploy" command is:

Usage: ceres rancher_deploy [OPTIONS]

  Deploy using rancher API

Options:
  --rancher-url TEXT              rancher server API endpoint URL  [required]
  --rancher-key TEXT              rancher account or environment API access
                                  key  [required]
  --rancher-secret TEXT           rancher account or environment API secret
                                  corresponding to the access key  [required]
  --rancher-env TEXT              used to specify environemnt if account key
                                  is provided
  --stack TEXT                    stack name defined in rancher  [required]
  --service TEXT                  service name defined in rancher  [required]
  --batch-size INTEGER            number of containers to upgrade at once
  --batch-interval INTEGER        interval (in second) between upgrade batches
  --sidekicks / --no-sidekicks    upgrade sidekicks services at the same time
  --start-before-stopping / --no-start-before-stopping
                                  start new containers before stopping the old
                                  ones
  --help                          Show this message and exit.

Source code of rancher_deploy command as below:

@cli.command()
@click.option('--rancher-url', required=True, help='rancher server API endpoint URL')
@click.option('--rancher-key', required=True, help='rancher account or environment API access key')
@click.option('--rancher-secret', required=True, help='rancher account or environment API secret corresponding to the access key')
@click.option('--rancher-env', default=None, help='used to specify environemnt if account key is provided')
@click.option('--stack', required=True, help='stack name defined in rancher')
@click.option('--service', required=True, help='service name defined in rancher')
@click.option('--batch-size', default=1, help='number of containers to upgrade at once')
@click.option('--batch-interval', default=2, help='interval (in second) between upgrade batches')
@click.option('--sidekicks/--no-sidekicks', default=False, help='upgrade sidekicks services at the same time')
@click.option('--start-before-stopping/--no-start-before-stopping', default=False,
              help='start new containers before stopping the old ones')
def rancher_deploy(rancher_url, rancher_key, rancher_secret, rancher_env, stack, service,
                   batch_size, batch_interval, sidekicks, start_before_stopping):
    """Deploy using rancher API"""
    rancher_cli = RancherClient(rancher_url, rancher_key, rancher_secret)
    env_id = rancher_cli.environment_id(rancher_env)
    if not env_id:
        click.Abort('Environment {} not found in rancher'.format(rancher_env))
    service_info = rancher_cli.service_info(env_id, stack, service)
    if not service_info:
        click.secho('Service {} not found in rancher'.format(service), fg='red')
        click.Abort()
    service_id = service_info['id']
    click.secho('Check and finish service upgrade')
    service_info = rancher_cli.service_finish_upgrade(env_id, service_id)
    click.secho('Service info:')
    click.secho(json.dumps(service_info))

    # do upgrade
    rancher_cli.service_upgrade(env_id, service_id, batch_size, batch_interval, sidekicks, start_before_stopping)
    click.secho('Waiting for upgrade finish')
    service_info = rancher_cli.service_finish_upgrade(env_id, service_id)

    click.secho('Service {} deploy complete on {}'.format(service_info['name'], rancher_url))
    return service_info

# -*- coding: utf8 -*-

import json
from typing import Dict
from time import sleep
from copy import deepcopy

import requests


class RancherClient(object):
    def __init__(self, endpoint_url: str, key: str, secret: str):
        """
        Args:
            endpoint_url (str): rancher server API endpoint URL
            key (str): rancher account or environment API access key
            secret (str): rancher account or environment API secret corresponding to the access key
        """
        self.endpoint_url = endpoint_url
        self.s = requests.Session()
        self.s.auth = (key, secret)
        # timeout in second for retry
        self.timeout = 60

    def environment_id(self, name: str=None) -> str:
        """
        Get rancher environement ID. If using account key, return the environment ID specified by `name`.

        Args:
            name (str): name for the environment requested (only useful for account key)

        Returns:
            environment ID in string
        """
        if not name:
            r = self.s.get('{}/projects'.format(self.endpoint_url), params={'limit': 1000})
        else:
            r = self.s.get('{}/projects'.format(self.endpoint_url), params={'limit': 1000, 'name': name})
        r.raise_for_status()
        data = r.json()['data']
        if data:
            return data[0]['id']
        return None

    def service_info(self, environment_id: str, stack_name: str, service_name: str) -> Dict:
        """
        Get rancher service info by given environment id and service name.

        Args:
            environment_id (str): defined environment id in rancher
            stack_name (str): defined stack name in rancher
            service_name (str): defined service name in rancher

        Returns:
            service info in json
        """
        if not environment_id:
            raise Exception('Empty rancher environment ID')
        r = self.s.get('{}/projects/{}/stacks'.format(self.endpoint_url, environment_id),
                       params={'limit': 1000, 'name': stack_name})
        r.raise_for_status()
        data = r.json()['data']

        if not data:
            # stack not found
            raise Exception('Stack {} not found'.format(stack_name))
            return None

        stack_info = deepcopy(data[0])

        r = self.s.get('{}/projects/{}/services'.format(self.endpoint_url, environment_id),
                       params={'name': service_name})
        r.raise_for_status()
        data = r.json()['data']
        if not data:
            # service not found
            return None
        for service_info in data:
            if service_info['stackId'] == stack_info['id']:
                return service_info
        return None


    def service_finish_upgrade(self, environment_id: str, service_id: str) -> Dict:
        """
        Finish service upgrade when service is in `upgraded` state.

        Args:
            environment_id (str): defined environment id in rancher
            service_id (str): defined environment id in rancher

        Returns:
            service info in json
        """
        r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
        r.raise_for_status()
        data = r.json()
        if data.get('type') == 'error':
            raise Exception(json.dumps(data))
        if data['state'] == 'active':
            return data

        if data['state'] == 'upgrading':
            retry = 0
            while data['state'] != 'upgraded':
                sleep(2)
                retry += 2
                if retry > self.timeout:
                    raise Exception('Timeout of rancher finish upgrade service {}'.format(service_id))
                r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
                r.raise_for_status()
                data = r.json()

        if data['state'] != 'upgraded':
            raise Exception('Unable to finish upgrade service in state of {}'.format(data['state']))
        r = self.s.post('{}/projects/{}/services/{}/'.format(self.endpoint_url, environment_id, service_id),
                        params={'action': 'finishupgrade'})
        r.raise_for_status()

        # wait till service finish upgrading
        retry = 0
        while data['state'] != 'active':
            sleep(2)
            retry += 2
            if retry > self.timeout:
                raise Exception('Timeout of rancher finish upgrade service {}'.format(service_id))
            r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
            r.raise_for_status()
            data = r.json()

        return data

    def service_upgrade(self, environment_id: str, service_id: str, batch_size: int=1,
                        batch_interval: int=2, sidekicks: bool=False, start_before_stopping: bool=False) -> Dict:
        """
        Upgrade service

        Args:
            environment_id (str): defined environment id in rancher
            service_id (str): defined environment id in rancher
            batch_size (int): number of containers to upgrade at once
            batch_interval (int): interval (in second) between upgrade batches
            sidekicks (bool): upgrade sidekicks services at the same time
            start_before_stopping (bool): start new containers before stopping the old ones

        Returns:
            service info in json
        """
        r = self.s.get('{}/projects/{}/services/{}'.format(self.endpoint_url, environment_id, service_id))
        r.raise_for_status()
        data = r.json()
        if data.get('type') == 'error':
            raise Exception(json.dumps(data))
        if data['state'] != 'active':
            raise Exception('Service {} in state of {}, cannot upgrade'.format(service_id, data['state']))

        upgrade_input = {'inServiceStrategy': {
            'batchSize': batch_size,
            'intervalMillis': batch_interval * 1000,
            'startFirst': start_before_stopping,
            'launchConfig': data['launchConfig'],
            'secondaryLaunchConfigs': [],
        }}
        if sidekicks:
            upgrade_input['inServiceStrategy']['secondaryLaunchConfigs'] = data['secondaryLaunchConfigs']

        r = self.s.post('{}/projects/{}/services/{}/'.format(self.endpoint_url, environment_id, service_id),
                        params={'action': 'upgrade'}, json=upgrade_input)
        r.raise_for_status()

        return r.json()

Attention that here I use version 2.0 beta of rancher API. If you want to use rancher API v1.0, some modifs must be made.

The benefits of encapsulation of rancher triggers logic in python package (command) is to hide complex of scripts in GitLab CI config. And more importantly, it decouples CI workflow and CD execution details. Otherwise, every change in rancher deployment logic will cause change of .gitlab-ci.yml file in every project following the CI/CD workflow. It would be a disaster.

An elegant solution is to build the python package into the gitlab runner docker image, and register the runner in dock mode in gitlab.

Conclusion

In this post, it introduces a solution of CI/CD pipeline based on Gitlab CI and rancher. It explains why choose gitlab and rancher to implement the whole workflow. In the end of the article, it demonstrates details of implementation and design.

Tracing System Overview

Tracing system has become nowadays a must-have infrastructure besides long existing logging system. It is not surprising since micro-service architecture has being introduced during last decade (nevertheless a lot of developers and so-called architects do not exactly know to implement it). This article aims to give a clear outline of tracing systems from views of features, architectures, pros and cons by analysis the most famous tracing system, which are Dapper, Zipkin, Jaeger and OpenTracing.

Google Dapper

When we talk about tracing system in context of micro-service architecture, it is impossible to ignore the famous Google Dapper paper. It is so important not only because it's the first practical large-scale distributed tracing system running in complex production environment, but also because of key features, definitions and best practices it unveiled.

Terms

I guess you have heard or seen "span" when tried to use tracing system client, but you may be confused why there is a "span". It is from below frequently quoted diagram:

It is a elegant data structure for tracing data. It's why all the followers keep using the design and the terms untouched.

Another important term introduced is "annotation", which provides a mechanism of application specific information extension. It quotes in chapter 3.3 of the paper, "Programmers tend to use application-specific annotations either as a kind of distributed debug log file or to classify traces by some application-specific features". Other tracing systems also have similar designs for the same proposal. In fact, all the experiences and use cases in Google mentioned in chapter 6 of the paper rely on annotation attached info more or less. On the other hand, authors did not forget to remind "small overhead" principle in the paper. It is necessary because there are still some people use tracing system to do the work of other logging and statistics tool.

Tracing System Implementation Principles

Other than definitions of tracing system data structures, the other contribution of Dapper is to explain how to implement a tracing system (even not in details) and why.

Let's look at anther thousand times quoted diagram as below:

The pipeline can be concluded in several key points:

Trace data is sent to a specific local daemon. Services do not connect directly with the collector to avoid latency or blocking error.
All daemons send data to central collector
Daemon sent data should be sampled in order to keep data collection is small as possible, while fraction of data does not degrade tracing quality.
The central collector save tracing data in search oriented no-sql database with appropriate indexes.

All these principles are completely inherited by followers.

Dapper paper gives other important notes which are not only for google internal usage but more universal for all cases.

For example, Ubiquitous deployment. In google the dapper client is embed in google internal gRPC framework, trace context is automatically injected. And dapper daemon is automatically deployed in service container using common google internal base image for services. Though we cannot use all these google internal infrastructures (in fact we do not need to....), it is still a very helpful best practice.

There are other significant suggestions such as second sampling of aggressive tracing data in collector, usage of column db like HBase (open source implementation of BigTable) to save tracing data, etc.

Zipkin

Zipkin is the first well-known open source implementation of Dapper like system. The relation between Zipkin and Dapper is just like that between Hadoop and BigTable.

Zipkin focuses on the part of its collector with BigTable like storage (Zipkin uses Cassandra and Elasticsearch), Dapper DAPI like API and GUI. I believe that most users choose it because of its out-of-box web UI, it was really fantastic at that time (maybe not so good as Dapper, but impressed enough for those not working in google), though Zipkin is missing some features such as trace data sampling, daemon (on reporter server) controlling etc.

It is a straightforward implementation using mature Java open source component. I think there were many similar implementations were done silently. Zipkin was just the first and might be one of the best ones. Some talented and ambitious teams are users of Zipkin. Among them, there was Uber devop team, which made their own tracing system later. And I will put a few words on it in following chapters.

Zipkin has been changed name to OpenZipkin, aims to reduce more contribution from community. But compared with opentracing/jaegertracing which are already in Cloud Native Foundation, Zipkin looks not so sexy to be the first choice for new projects implementing tracing.

Opentracing

Opentracing is a project for universal tracing specification rather than ready-for-production implementation. In the GitHub repo of opentracing, you can find:

Semantic Specification: definition of data models (trace, span, context) and APIs. It gives implementation details of the tracing data models which are missing in original Google Dapper paper.
Semantic Conventions: tags for span, just like the usage of kv in annotations mentioned in Google Dapper paper.
Client libs for different languages: in fact opentracing does not provide production ready tracing libraries. It just defines the APIs following opentracing specification. Specific instruments need to inherit the lib of opentracing clients as base class and overload the methods. (just like what Uber jaeger client does).

For more clear data model overview, you can look at the following diagram:

Remember that opentracing is just specification. You use it as guideline but not a out-of-box tracing production. It is a good starting point to develop your own tracing system or to understand how tracing system works. Attention that opentracing does not cover any specification or implementation details for tracing daemon and centralized collector.

The specific tracing clients used in frameworks, libraries and projects are called instrumentations. Seeing different implementation of context, there is no way to have a generalized instrumentations for all the frameworks. OpenTracing gives a clear guide to develop instrumentation based on it:

The work of instrumentation libraries generally consists of three steps:

When a service receives a new request (over HTTP or some other protocol), it uses OpenTracing's inject/extract API to continue an active trace, creating a Span object in the process. If the request does not contain an active trace, the service starts a new trace and a new root Span.

The service needs to store the current Span in some request-local storage, where it can be retrieved from when a child Span must be created, e.g. in case of the service making an RPC to another service.

When making outbound calls to another service, the current Span must be retrieved from request-local storage, a child span must be created (e.g., by using the start_child_span() helper), and that child span must be embedded into the outbound request (e.g., using HTTP headers) via OpenTracing's inject/extract API.

For direct usage of tracing clients (instrumentations), you can find mainstream ones at OpenTracing API Contributions. If you are using some Python, you may find some useful lib tracing instrumentations at Uber OpenTracing Python Instrumentation.

Review of Kubernetes in Action

This contains all the necessary knowledge you need to know about kubernetes as a beginner. I have to say I really like the illustrations. They are very helpful for understanding. Sometimes you can just understand the concepts from kubernetes very easily by glancing the illustration, without reading a single word.

The experienced kubernetes user can just jump to chapter 17 (best practice for developing apps) & 18 (extending kubernetes). These two chapters touch the real-world problem and give a clear guideline for solution. Especially ch 17, it can be used a very good entrypoint and overview for understanding kubernetes, from usage view.

From my point of view, there is still a lot of deeper topic can be included in this book, e.g. Rancher as a practical Service Catalog, CI/CD usage with kubernetes, different distribution of k8s etc. Maybe these contents would be in next version of the book.

Build a Complete eCommerce System: Prologue

eCommerce is a classic Internet application since it arrives the "web age". Now we use the services of amazon, alibaba, ebay everyday, from PC and smartphone. All of them are actually eCommerce system, though they can be put into different business modes. We are so familiar with these services, processes, therms and logics behind. From the view of technique, it is easy to understand the foundamental components and software that used to build a eCommerce website. One can eailly to build a eCommerce website within 10 minutes using some out-of-box open source solutions such as magento.

So why I am still trying to reexplain the architecture of the system with all functionalities and implementation details?

Why?

Unfortunately, most of us (professionals as software developpers, product managers and IT consultants) do not know the eCommerce system & technologies as well as we imagin. And we are frequently mading mistakes from system design to the details of implementations.

The problem can be concluded as:

Incorrect or Incomplete Domain Modelling
I've seen a lot of really bad designs and implementations. There are so many developpers and PMs who don't know what are SPU & SKU and what they work for. It exists a lot of eCommerce system that do not have appropriate categories and attributes, which caused endless refactoring and failures in daily operation. So many developpers start to build a eCommerce system without a clear plot for order lifecycle and its interactions with stocking functions, logistics functions and promotion functions. I do not tend to blame anyone. In fact I have made all above mentioned mistakes in my career. And I find all these mistakes could be avoided if there were a correct domain model and appropriate moduling of the system. It is pitty that there are still so many new commer who are repeating the similar mistakes and wasting time and resources to repair them (or even worse to produce more mistakes). I hope that I could give a general eCommerce domain model to avoid the fundemental mistakes from the begining.
Business Logic Mismatch
With arrivals of new business modes and innovations. The system has to adapt itself to the new patterns and 3rd party systems such as payment gateway, CRM SAAS, new processes and so on. It will definitively increase the complexity of the system. While it comes some internal evolutions such as upgrade of stockhouse, development of logistics networks etc. All of the changes from internal or external environment would affect the system, or even worse, break the original architecture of the system. It is why we need to design the system carefully to avoid potential problems.

Besides the existing misunderstandings of eCommerce system, the other reason to spotlight on eCommerce system is that it covers almost all parts of general information systems. In fact, you can find all 2B software modules in eCommerce system such as CRM, CMS, WMS, MRP etc. And more on, there are more and more new technologies are involved, e.g. search engine, NLP, machine learning, etc. Since eCommerce is now the largest general comprehensive fundamental application in the IT world, it is necessary for every IT professional to understand it. No matter in which specific domain you are working on, all these knowledge will be helpful.

Commodity Module
- Category & Attribute
- SKU & SPU
- Inventory
User Module
- Authentication
- Back Office Role & Authorities
- Cart
- User Tagging
Order Module
- FSM of Order
- Order Splitting
- Interaction with Other Modules
Payment Module
Logistics Module
CMS Module
Warehouse Management Module
Search & Ranking Service

Real World Practise

This series of articles will not only focus on domain model and logic architecture of the eCommerce system, but will also give a concrete code demonstration. The code is for better understanding of the archtecture and gives some inspiration of implementations. You may have different implementations as well. It depends totally on your demand from the real business scenarios. And don't forget that a good system is never the most complex one but the most suitable one. If you are just begining to build a new system, I recommand to start with mandatory modules and keep them simple. Once you understand better the business demand and your domain models, you could easily scale your system out and add more functions into it. For example, you might detach the inventory function into an independant service when the amount of SKU is drastically increased and your warehouse management is also becoming complex.

Review of Deep Learning with PyTorch

Issued by PyTorch official team, free download on PyTorch official site. this book is still recommended for the green hands who are trying to step in the machine learning world. Compared with the official tutorial and documents which are already very good, this book is relatively more systematic. I take it as a pre-tutorial with all necessary basic ML knowledge and operation guides for the new PyTorch users before they dive into detailed examples.

Chapter 1 is a very general and shallow introduction, can be easily skipped. Chapter 2 is focused on basic Tensor operations. Examples are very easy to understand. Chapter 3 is about data import. The book gives some simple example as CSV, txt and image. Not in detail, it's better to look at related chapters in PyTorch document instead.

Chapter 4 introduce easy linear regression example using PyTorch autograd and demonstrates how to do basic BP.

Though the example is very easy, it still highlights the most important concept and operation of autograd option of tensor.

The book does not explain too much how PyTorch does the BP and form the graph. It puts more focus on necessary basic operation and some important tips e.g. zero gradation after backward, switch off unnecessary autograd etc.

Chapter 5 introduces torch.nn module, and replace the hand made model in chapter 4 with a simple network using torch.nn.Sequential. Nothing special for neural network definition. Some basic ML concepts are mentioned, including why neural network can be used to approach any function (not very clear...) and why need activation functions.

To summarize, this book is a very book starting point for those who are new to PyTorch (even better than the official tutorial). The content is sequential and easy to understand, also thanks to the simplicity of PyTorch itself, a reader with no any ML background can easily understand all necessary basis knowledge and immediately start simply tasks. Highly recommended for 101 class.

Python Package with Dependencies

As Python developers, we usually build Python package to share our work (devpi for internal & private scope and PyPI for global opensource work). Normally basic usage of setup.py is enough for these works. But there is still some specific scenarios which need more sophisticated packaging process. For example, in specific CI/CD workflow, you don't want take time to download package dependencies from PyPI (or devpi) every time when build the docker image. So how to do this?

Create wheel house

The first thing to do is to make all dependencies local packages (wheel). Pip has this function as:

pip wheel -r requirements.txt --wheel-dir <path>

The command will download all dependencies defined in requirements.txt from PyPI as usual, and make them as particular wheel package in the directory specified.

To simplify the operation, you can put it in setup.py command. The following snippet demonstrates the a package procedure with dependencies wheel making.

class PackageCommand(Command):
    """Support setup.py package"""

    description = 'Build package.'
    user_options = [
        ('with-deps', None, 'package with all wheel packages of dependencies'),
    ]

    def initialize_options(self):
        """Set default values for options"""
        self.with_deps=False

    def finalize_options(self):
        """Post-process options"""
        if self.with_deps:
            self.with_deps=True

    def run(self):
        """Run command"""
        clear_files = [
            os.path.join(pwd_path, 'build'),
            os.path.join(pwd_path, 'dist'),
            os.path.join(pwd_path, '*/*.egg-info'),
            os.path.join(pwd_path, 'Youtiao.egg-info'),
        ]
        for cf in clear_files:
            print('rm {}'.format(cf))
            subprocess.run(['rm', '-rf', cf])

        # make sure that wheel is installed
        subprocess.run(['python', 'setup.py', 'bdist', 'bdist_wheel', '--universal'])

        if self.with_deps:
            rqm_path = os.path.join(pwd_path, 'requirements.txt')
            wheels_path = os.path.join(pwd_path, 'wheels')
            subprocess.run(['rm', '-rf', wheels_path])
            subprocess.run(['mkdir', '-p', wheels_path])
            subprocess.run('pip wheel --wheel-dir={} -r {}'.format(wheels_path, rqm_path), shell=True)

        sys.exit(0)

Put it in your setup cmdclass of arguments.

Make docker image without dependency download

Normally we run simple pip install -r requirements.txt in docker build script. It needs some modification if you want to avoid download from PyPI. A simple Dockerfile example as:

FROM python:3.6

COPY . /app
WORKDIR /app

# install wheel 
# make sure that your project wheel under /app/dist
# and wheels of dependencies under /app/wheels
RUN pip install /app/dist/<project_wheel> --no-index --find-links /app/wheels

ENTRYPOINT ["python"]
CMD ["start_point_script.py"]

Voila

Why use Github issue page as blog?

Since the first release of GitHub page, it has become the most popular way to build and host personal blog in geek's world. It seems that you would look like a degenerated caveman if you don't have one. And, yes, I was also one of the hipsters. I spent tens of hours in configuring my blogs using Hexo and jekyll, enjoyed with the fancy plugins and widgets. And after several weeks, I was quit with zero meaningful article left on my blog.

So sad.

When I look back and tried to restart my tech blog writing (meaningless selfword in fact), I asked myself what are the features I really need?

Not difficult to list in below:

online editor supporting markdown styling
host on popular hacker community (cannot find any substitute for github... )
free service
easy to update
easy to comment and reply

No need of fancy UI nor tags. (It is more useful for the front end developers, but not for me). After browsing github existing functions, the issue page seems 100% fit my basic requirements. Super effecient publish & update on all sizes of screens and network conditions, no more CSS & JS download overheads when you get into the edition page (image you're in metro and your smart phone can only connect 2G edge network). You can even find bonus functions here, e.g. labels, emojis, online draft saving, comment alerting etc..

More than perfect solution, I have to say.