Problem Statement:
Monitoring all components inside the cloud console is restricted only to the moderator user who has access to the cloud console, the end user doesn’t have access which includes the warehouse user. The warehouse user/internal business user doesn’t have an insight of the CPU percentages, Ram percentages, POD restarts, POD status, POD phases etc.. until devops/SRE team gives a report to the team , it is not only the stateless pod but all stateful-set pod as well like RabbitMQ, elastic search. Until the pod go down is when the user came to know that there is an hiccup in the system.
Solution Overview:
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. It is now a standalone open-source project and maintained independently of any company. To emphasize this, and to clarify the project’s governance structure, Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. For more elaborate overviews of Prometheus, see the resources linked from the media section.
Prometheus’s main features are:
- A multi-dimensional data model with time series data identified by metric name and key/value pairs
- PromQL, a flexible query language to leverage this dimensionality
- No reliance on distributed storage; single server nodes are autonomous
- Time series collection happens via a pull model over HTTP
- Pushing time series is supported via an intermediary gateway
- Targets are discovered via service discovery or static configuration
- Multiple modes of graphing and dashboarding support
Prometheus was implemented to monitor the pod as a metrics through grafana (ie) datasource for grafana is from Prometheus. This in turn would reflect the status of the pod as well as the CPU,Ram consumed by specific nodes of the pod against a namespaces. If the CPU or Ram goes below a stipulated threshold then alert would be send through the slack notification through an alert manager.The slack channels can be segregated based on pods,dbs, api endpoints so that all the alerts wouldn’t flock in one channel rather it get displays in separated slack channel like Pods,Dbs,api etc.
Tech Stack leveraged:
- Alert Manager with slack integration
- Grafana
- Node Exporter
- Metrics
- Prometheus server and operator
Benefits delivered:
- Metric collection: Prometheus collects and stores time-series data, allowing you to monitor various aspects of your applications, infrastructure, and services. It supports multi-dimensional data collection with labels, making it flexible and powerful.
- Scalability: Prometheus is designed to scale horizontally, making it suitable for large and dynamic environments. You can deploy multiple Prometheus instances and federate them to centralize monitoring across different clusters or geographical locations.
- Alerting: Prometheus includes a built-in alerting system that allows you to define and manage alerts based on specific conditions or thresholds. This ensures that you are promptly notified of any issues or anomalies in your systems.
- Query Language: Prometheus comes with PromQL (Prometheus Query Language), a powerful query language that allows you to analyze and aggregate collected metrics. This enables detailed analysis and troubleshooting of performance issues.
- Service Directory: Prometheus supports service discovery mechanisms, automatically discovering and monitoring new instances of services as they come online or go offline. This dynamic discovery makes it well-suited for dynamic and containerized environments.
- Integrations: Prometheus has a rich ecosystem of integrations with various systems and tools, allowing you to pull in metrics from different sources. Common integrations include exporters for databases, cloud platforms, and other third-party services.
- Grafana Integration: Prometheus is often used in conjunction with Grafana, a popular open-source dashboard and visualization platform. The integration with Grafana provides a user-friendly interface for creating customizable dashboards and visualizing Prometheus metrics.
- Community support: Being open-source, Prometheus has a strong community that actively contributes to its development and provides support. This means you can benefit from a wealth of resources, including documentation, forums, and community-contributed integrations.
- Cloud Native support: Prometheus is well-suited for monitoring applications in cloud-native environments. Its support for dynamic service discovery and integration with container orchestration platforms like Kubernetes makes it a popular choice for DevOps teams.
- Reliability and Durability: Prometheus has a simple and reliable storage model that ensures data durability. The local storage engine is designed to handle high write and query loads efficiently.
These benefits make Prometheus a valuable tool for organizations looking to implement robust monitoring and alerting solutions for their applications and infrastructure.