At InterCloud, monitoring our infrastructure is vital to guarantee reliability at all times. Several months ago, we had to face some restrictions in the actual monitoring system that originated from SmokePing and Munin. After further investigation, we found a flaw in the underlying database (RRD).
To solve these issues, we decided to build a distributed Operation Network & System Support (ONSS) stack with Open Source components:
- Telegraf: a data collector written in Go for collecting, processing, aggregating and writing metrics. The tool is well documented and straightforward which allowed us to custom input plugins for VSphere & Icinga2.
- InfluxDB: a scalable time series database for metrics, event and real-time analytics.
- Grafana: a data visualization and exploration tool. It lets you create graphs and dashboards based on data from various data sources (InfluxDB, Prometheus, Elasticsearch, CloudWatch ...). Chronograf doesn’t cut the mustard at this point in time when compared to Grafana. It doesn’t come as much of a surprise given that it is still a relatively young project. It will need time to build maturity, but is on track to meet our goals.
- Icinga2: our existing open source tool that we use to monitor the health of networked hosts and services (BGP sessions, interface state, etc.). This could be replaced with Telegraf as it supports plugins for SNMP & NetConf. However, for the 1st version of the project, we decided to delay this migration to a later stage.
- Kapacitor: a data processing framework for anomaly detection and alerting (support multiple event handlers: Slack, OpsGenie, HipChat, PagerDuty, etc.)
To automate the build, ship & run of our containers, every push to the code repository triggers a Jenkins build which will be scheduled on one of the available Jenkins Slaves. The slave will create a Docker Image of the service, then push the artifact to our Docker Registry. Finally, a downstream job is triggered to deploy the release to the related ONSS environment. The whole process is illustrated in the image below :
We, then, use a Jenkins Multibranch Pipeline in order to build a different image for each environment (sandbox, staging, production). Every branch has its own Dockerfile, custom configuration files for the environment, as well as a Jenkinsfile which defines the CI/CD pipeline logic for the project captured in various stages:
Defining the version or building a number of our ONSS components is the core of continuous integration and deployment. That's why, we use the Semantic Versioning guideline to define the different versions:
To create multiple ONSS environments (staging & production), we used Terraform alongside with Packer to provision & manage Infrastructure on VSphere & AWS. Once created, we use Ansible with built in roles to turn the servers into a Swarm Cluster. Finally, for service discovery, we use Traefik as our reverse proxy:
This has several benefits:
- Version Control: changes can be tracked, rollback in case of failure.
- Infrastructure as Code: the state of our infrastructure is captured in source files.
- Validation: every single change triggers code review, automated tests and passes the code through static analysis tools to reduce the chance of defects.
- Blue/Green deployment: zero downtime & high availability of the monitoring system.
- Reuse: production like environmentto reproduce issue.
- Disaster recovery: with Terraform, Packer, Ansible & Docker.
- Better monitoring: from physical to application layer.
- Error handling & anomaly detection.
- Quick deploy (average 11 seconds)
By the way, if any of these subject matters are of interest to you, InterCloud is quickly expanding our engineering team (engineers, architects, leads, etc.). If you’d like to work with InterCloud, drop us a message!