I'm also seriously considering dropping Grafana for good for the same reasons stated in the post. Every year I need to rebuild a dashboard, reconfigure alerts, use the shiny new toy, etc etc. I'm tired.
I just want the thing to alert me when something's down, and ideally if the check doesn't change and the datasource and metric don't change, the dashboard definition and the alert definition should be the same for the last and the next 10 years.
The UI used to have the most 4-5 important links in the sidebar, now it's 10 menus with submenus of submenus, and I never know where to find the basics: Dashboards and Alerts. When something goes off I don't have time to re-learn the UI I look at maybe once a month.
Building an elaborate pile of technical debt is a great way to have an elaborate pile of technical debt, but the lifespan of services being 2-3 years gets painful as you start composing a stack out of enough products that every quarter you need to replace something big.
I don't know why software developers feel the urge to stay on the bleeding edge of every product and update their setup every week then turn around and compain "this stuff isn't stable!"
I've had a grafana + prometheus setup on my servers since like 2017. It worked then and works today. I log in maybe once every year or two to update to a newer LTS version. Every dashboard is still pristine, and nothing has ever broken.
I don't understand most of the words in the linked post and don't need to. The core package is the boring solution that 99% of people here need, and that works great.
Mimir is just architected for a totally different order of magnitude of metrics. At that scale, yeah, Kafka is actually necessary. There are no other open-source solutions offering the same scalability, period.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
Have a look at Victoria Metrics - have run it at a relatively high scale with much more success than any other metric stores. It's one of those things that just work. It's extremely easy to run at in a single-instance mode and handles much more than you would expect. Scaling it is a breeze too.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
The project where I looked at Mimir was a 500+ million timeseries project, with the desire to support scaling to the ten-figure level of timeseries (working for a BigCo supporting hundreds of product development teams).
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
I'm not sure where you saw that Victoria Metrics uses object storage. It doesn't - it uses block storage and it runs completely fine on HDD, you don't even need SSD/NVMe.
There are multiple ways to deal with ingestion floods. Kafka/distributed log is one of them, but it's not the only one. In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels.
Resource usage for ingestion/storage is much lower than other solutions, and you get more for your money. At $PREVIOUS_JOB, we migrated from a very expensive Thanos to a VM cluster backed by HDDs, and saved a lot. Performance was much better as well. It was a while ago, and I don't remember the exact number of time series, but it was meant to handle 10k+ VMs (and a lot of other resources, multiple k8s clusters) and did it with ease (also for everybody involved).
I don't think you have really looked into VM - you might get pleasantly surprised by what you find :) Check out this benchmark with Mimir[1] (it is a few years old though), and some case studies [2]. Some of the companies in the case studies run at significantly higher volume than your requirements.
There were other problems with VictoriaMetrics - a failed migration attempt by previous engineers made it politically difficult to raise as a possibility, lack of a promise of full PromQL compatibility (too many PromQL dashboards built by too many teams), seeing features locked behind the Enterprise version (Mimir Enterprise had features added on top, not features locked away).
> HDD
You're right, I'm misremembering here, that particular complaint about a lack of Kafka was a Thanos issue, not VM.
That said, HDD is a hard sell to management. Seen as "not cloud native". People with old trauma from 100% full disks not expanded in time. Organizational perception that object storage does not need to be backed up (because redundancy is built into the object storage system) but HDD does (and automated backups are a VM Enterprise feature, and even more important if storing long-term metrics in VM).
> In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels
So are Thanos and Mimir, which suffer from ingest floods causing DoS, at least until Kafka was added. vminsert is billed as stateless, same as Thanos Receiver, same as Mimir Distributor. Not convinced.
What's your preferred solution for observability and monitoring of tiny apps?
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
I'm personally not convinced OpenTelemetry is the future. I get the desire to not be vendor-locked to a single provider, but Prometheus and Jaeger are very solid, battle-hardened, popular, well-maintained, easily self-hosted open-source projects. For small deployments you do not need to overthink things here - Grafana, Prometheus, Jaeger (with local disk storage), logging depends on how many machines you're talking about and where they're hosted (e.g. GCP Cloud Logging is fine for GCP-hosted projects, the 50 GB free tier is a lot for a small project) but as a default Loki is also just fine and much better than Elastic/OpenSearch.
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
This is a lot of infrastructure, we are talking about a tiny app here. Are you sure this is warranted?
Honestly I would prefer to have observability as a library, that's not feasible because of two factors, a) I really want distributed tracing (no microservices - I just want to combine traces from frontend and backend) so I need a place to join them, and b) it could/would lead to loss of traces when the program crashes.
In any case, it makes sense for me to choose tracing and metrics libraries that can output either OpenTelemetry or Prometheus and Jaeger, in the event that OpenTelemetry is not enough.
> Loki is also just fine and much better than Elastic/OpenSearch.
I'm scratching my head a little bit on what your expectation is here. Traces and real-user-monitoring are not the same thing here. Distributed tracing is specifically a microservices thing. Maybe all you're looking for is to just attach a UUID to each request and log it? Jaeger and Tempo aren't going to help you with frontend code.
> A lot of infrastructure
> Prometheus
You need something to tell you when your tiny app isn't running, so it can't be a library embedded into the app itself.
> Grafana
You need something with dashboards to help you understand what's going on. If your thing telling you your app has crashed is outside your app, the thing that helps visualize what happened before your app crashed also needs to be outside your app.
> Jaeger
Do you really need traces? Or just to log how long request times took, and have metrics for p50/p95/p99?
> Loki
If you're running only one instance of your app, you don't need it, just read through your logfiles with hl. If you have multiple machines, sending your logs to one place may not be necessary, but it's incredibly helpful; the alternative is basically ssh multiplexing...
Thank you for that. I absolutely love that this uses tantivy.
I was previously leaning torwards VictoriaMetrics and VictoriaTraces (I will need both) but I think that OpenObserve is even simpler. Later I found Gigapipe/qryn https://github.com/metrico/gigapipe
Does OpenObserve ships something to view traces and metrics? (it appears that Gigapipe does). Or am I supposed to just use Grafana? I want to cut down on moving pieces.
I'm looking at OpenTelemetry because of broad tooling compatibility (both Rust tracing crates, tracing and emit, support it - for logs, tracing and metrics) and it seems like something that will stick around.
Also I'm not sure I will ever need actual performance out of an observability solution; it's a tiny app after all.
What's the most promising alternative to Prometheus/Grafana if you're developing a new solution around OTEL? If you could start today and pick tools, what would you go for?
We also started with the typical kube-prometheus-stack, but we don’t like Prometheus/PromQL. Moreover, it only solves the „metrics“ part - to handle logs and traces, more quite heavy and complex components have to be added to the observability stack.
This didn‘t feel right, so we looked around and found greptimedb https://github.com/GreptimeTeam/greptimedb, which simplifies the whole stack. It‘s designed to handle metrics, logs, and traces. We collect metrics and logs via OpenTelemetry, and visualize them with Grafana. It provides endpoints for Postgres, MySQL, PromQL; we‘re happy to be able to build dashboards using SQL as that’s where we have the most knowledge.
The benchmarks look promising, but our k8s clusters aren’t huge anyway. As a platform engineer, we appreciate the simplicity of our observability stack.
Any other happy greptimedb users around here? Together with OTel, we think we can handle all future obs needs.
Thank you for giving GreptimeDB a shout-out—it means a lot to us. We created GreptimeDB to simplify the observability data stack with an all-in-one database, and we’re glad to hear it’s been helpful.
OpenTelemetry-native is a requirement, not an option, for the new observability data stack. I believe otel-arrow (https://github.com/open-telemetry/otel-arrow) has strong future potential, and we are committed to supporting and improving it.
FYI: I think SQL is great for building everything—dashboards, alerting rules, and complex analytics—but PromQL still has unique value in the Prometheus ecosystem. To be transparent, GreptimeDB still has some performance issues with PromQL, which we’ll address before the 1.0 GA.
Check OpenObserve https://github.com/openobserve/openobserve. It precisely was built to solve the challenges around grafana nd elastic. This is not a stack that you will need to weave together, just a single binary/container that would suffice for most users' needs - logs, metrics, traces, dashboards, alerts.
Having been using Grafana community/cloud for a number of years, my new gig is currently moving everything to SigNoz. Mostly slick, under active development, communicative team, open source... what's not to love?
Not sure what's an alternative for Grafana in the open source world in terms of building dashboards for o11y? I'm not aware of one and Grafana is used very extensively in my company...
I mentioned it in another reply, but https://perses.dev/ is probably the most promising alternative.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
How come I’ve never heard of Perses yet? A really Open Source, standardising Grafana clone to go alongside Prometheus for self-hosted deployments sounds just perfect!
from cursory reading of the article I don’t see that author’s problems are specifically with Grafana in its best use case (metrics), but with other products from Grafana company, for which are a lot of alternatives.
Grafana dashboards itself (paired with VictoriaMetrics and occasionally Clickhouse) is one of the most pleasant web apps IMO. Especially when you don’t try to push the constraints of its display model, which are sometimes annoying but understandable.
I remember that alternative, free/FOSS products existed before Grafana (c2015) but many died, Grafana was everywhere. Now I also cannot find the old-alts. Vague memories of RRD and Nagios...
Munin was what we used for a while, along with a smattering of smokeping.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
https://github.com/opensearch-project/OpenSearch-Dashboards (Kibana fork) is one. But Grafana is still way better if you just stay away from anything that isn't the core product: data visualization and exploration (explorer and traces).
we're opentelemetry-native and apart from many out of box charts for APM, infra monitoring, and logs, you can also build customized dashboards with lots of visualization option.
I use Signoz for my private purposes, it's not a 100% match, but you can do Prometheus metrics, logs analysis, dashboards, alerts, OTEL spans so depending on your usecase it can be enough
IMHO in a way the constant churn of Grafana made it easier to live with it. When they break your dashboards every major release you just learn to let go. And by break it’s not just making it error and not work, it’s the constant moving of things around, refactoring the UI, replacing one component with another, all accompanied by a number of glitches every time they rewrite things. You just accept and ignore it eventually.
What’s a bigger lock in for me is metrics and promql - you just can’t ever rename a poorly named metric or you face a world of pain. Or when Prometheus releases Native Histograms to replace the old ones, and suddenly everything from rules, alerts, ad-hoc queries and dashboards needs updating.
And PromQL is so opaque, it just never give you an error unless there is a syntax issue. We need tools like https://github.com/cloudflare/pint just to know if my alert description isn’t trying to render a label that’s just not gonna be there, etc
> And PromQL is so opaque, it just never give you an error unless there is a syntax issue. We need tools like https://github.com/cloudflare/pint just to know if my alert description isn’t trying to render a label that’s just not gonna be there, etc
PromQL should be blamed on prometheus though, not on grafana.
I frequently use a docker-compose template with prometheus pushgateway + grafana for deploying on single node servers, as described at the start of the article. It works well and is trivial to setup, but the complexity explodes once your metric volume or cardinality requires more scale like prometheus alternatives a la mimir.
I think this would not need to be an issue as frequently if prometheus had a more efficient publish/scraping mechanism. iirc there was once a protobuf metric format that was dropped, and now there is just the text format. While it wouldn't handle billions of unique labels like mimir, a compact binary metric format could certainly allow for millions at reasonable resolution instead of wasting all that scale potential on repeated name strings. I should be able to push or expose a bulk blob all at once with ordered labels or at least raw int keys.
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
logs, spans and metrics are stored as time-stamped stuff. sure simple fixed-width columnar storage is faster, and makes sense to special case for numbers (add downsampling and aggregations, and histogram maintenance and whatnot), but any write-optimized storage engine can handle this, it's not the hard part (basically LevelDB, and if there's need for scaling out it'll look like Cassandra, Aerospike, ScyllaDB, or ClickHouse ... see also https://docs.greptime.com/user-guide/concepts/data-model/ and specialized storage engines https://docs.greptime.com/reference/about-greptimedb-engines... )
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
I second SigNoz. I was paying a fortune for a cloud observability platform that cost more and more every month. Then I switched to self-hosted SigNoz on a cheap Hetzner box and now my observability stack costs $10 a month.
I only know of https://perses.dev/ but haven't had a look at it for ~half a year. It was very barebones back then but I'm hopeful it can replace Grafana for at least basic dashboarding soon.
This is pretty interesting to me, as I do use Grafana in my current role. But none of their other products, and not their helm chart (we're on the Bitnami chart if that's a thing).
So far it's pretty good. We're at least one major version behind, but hey everything still works.
I cannot imagine other products support as many data sources (though I'm starting to think they all suck, I just dump what I can in InfluxDB).
I agree. I think OP has made the mistake of using more than just Grafana for dashboards and perhaps user queries.
I operate a fairly large custom VictoriaMetrics-based Observability platform and have learned early on to only use Grafana as opposed to other Grafana products. Part of the stack used to use Mimir's frontend as caching layer but even that died with Mimir v3.0, now that it can't talk to generic Prometheus APIs anymore (vanilla Prom, VictoriaMetrics, promxy etc.). I went back to Cortex for caching.
Such a custom stack is obviously not for everyone and takes much more time, knowledge and effort to deploy than some helm chart but overall I'd say it did save me some headache. At least when compared to the Google-like deprecation culture Grafana seems to have.
I have found Grafana to be a decent product, but Prom needs a better horizontally scalable solution. We use Vector and Clickhouse for logging and works really well.
off topic, but prometheus pushgateway is such a bad implementation (once you push the metrics, it always stays there until it's restarted, like counter does not increase, it just pushes a new metric with the new value) that we had to write our own metrics collector endpoint.
That is literally how it is supposed to work. Prometheus grabs metrics --- that is how it works. If you for some reason find yourself unable to host an endpoint with metrics, you can use the fallback pushgateway to push metrics where yes they will stay until restarted. Ask yourself how it could ever work if they are subsequently deleted after read. How would multiple prometheus agents be able to read from the same source?
It sounds like you are using it for the wrong job. It’s supposed to be a solution for jobs / short running processes that don’t expose a /metrics endpoint for Prometheus long enough to be scraped and there you exactly want that kind of behavior.
The pushgateway is itself a horrible hack for the fact that prometheus is designed only for metrics scraping. Unfortunately the whole ecosystem around it is an utter mess.
Remote Write is a viable alternative in Prometheus and its drop-in replacements. I'm not a massive fan of it myself as I feel the pull-based approach is superior overall but still make heavy use of it.
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
Author here.
I know the old way still works and I respect that. Given the history I ask myself how long will it work, since ist not the default anymore.
FTA > "I know for a fact that that pace is partially driven by career-driven development."
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
without any stability, you really can’t blame the player for playing this game.
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
what are tested and fairly lightweight alternatives for Loki?
elastic stack is so heavy it's out of question for smaller clusters, loki integration with grafana is nice to have but separate capable dashboard would be also fine
This is what keeps me from using things like these. Meanwhile I have uinx-oid scripts that keep chugging on up to date servers for more than a decade without any need to change them.
I know there is always the temptation to make it really shiny and nice. But the more moving parts your system has the liklier it becomes that something will fail eventually. And as it happens these failures happen usually at the time when it is most inconvenient for all people involved.
That doesn't mean that complex software cannot work reliably, but it takes more effort for the developer side to honor that unwritten contract with their users (if they are even aware of it).
This is why sometimes doing it yourself, on your own servers can be benefitial because it gives you more control.
This article comes off sort of low effort and mentions a lot grievances without actual pinpointing precise issues. I think leveraging OTEL as a general processor with a generic output is a good idea, but discounting Grafana for implementing multi tenancy solutions and alloy which is pretty fucking good is kind of pointless.
As someone who runs SaaS products, this post resonates painfully well.
The author is 100% correct: Monitoring should be the most boring tool in the stack. Its one and only job is to be more reliable than the thing it's monitoring.
The moment your monitoring stack requires a complex dependency like Kafka, or changes its entire agent flow every 18 months, it has failed its primary purpose. It has become the problem.
This sounds less like a technical evolution and more like the classic VC-funded push to get everyone onto a high-margin cloud product, even at the cost of the open-source soul.
I'm also seriously considering dropping Grafana for good for the same reasons stated in the post. Every year I need to rebuild a dashboard, reconfigure alerts, use the shiny new toy, etc etc. I'm tired.
I just want the thing to alert me when something's down, and ideally if the check doesn't change and the datasource and metric don't change, the dashboard definition and the alert definition should be the same for the last and the next 10 years.
The UI used to have the most 4-5 important links in the sidebar, now it's 10 menus with submenus of submenus, and I never know where to find the basics: Dashboards and Alerts. When something goes off I don't have time to re-learn the UI I look at maybe once a month.
Building an elaborate pile of technical debt is a great way to have an elaborate pile of technical debt, but the lifespan of services being 2-3 years gets painful as you start composing a stack out of enough products that every quarter you need to replace something big.
I don't know why software developers feel the urge to stay on the bleeding edge of every product and update their setup every week then turn around and compain "this stuff isn't stable!"
I've had a grafana + prometheus setup on my servers since like 2017. It worked then and works today. I log in maybe once every year or two to update to a newer LTS version. Every dashboard is still pristine, and nothing has ever broken.
I don't understand most of the words in the linked post and don't need to. The core package is the boring solution that 99% of people here need, and that works great.
> Every dashboard is still pristine
How did you handle the angular deprecation in grafana? Or are you just staying in an older version that still supports it?
All of my dashboards were auto updated so I didn't really have to do anything.
I've done it and it wasn't too bad? The auto migrate worked pretty well for 99% of stuff (hundreds of dashboards).
Though I won't say I loved doing it.
Mimir is just architected for a totally different order of magnitude of metrics. At that scale, yeah, Kafka is actually necessary. There are no other open-source solutions offering the same scalability, period.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
Have a look at Victoria Metrics - have run it at a relatively high scale with much more success than any other metric stores. It's one of those things that just work. It's extremely easy to run at in a single-instance mode and handles much more than you would expect. Scaling it is a breeze too.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
The project where I looked at Mimir was a 500+ million timeseries project, with the desire to support scaling to the ten-figure level of timeseries (working for a BigCo supporting hundreds of product development teams).
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
I'm not sure where you saw that Victoria Metrics uses object storage. It doesn't - it uses block storage and it runs completely fine on HDD, you don't even need SSD/NVMe.
There are multiple ways to deal with ingestion floods. Kafka/distributed log is one of them, but it's not the only one. In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels.
Resource usage for ingestion/storage is much lower than other solutions, and you get more for your money. At $PREVIOUS_JOB, we migrated from a very expensive Thanos to a VM cluster backed by HDDs, and saved a lot. Performance was much better as well. It was a while ago, and I don't remember the exact number of time series, but it was meant to handle 10k+ VMs (and a lot of other resources, multiple k8s clusters) and did it with ease (also for everybody involved).
I don't think you have really looked into VM - you might get pleasantly surprised by what you find :) Check out this benchmark with Mimir[1] (it is a few years old though), and some case studies [2]. Some of the companies in the case studies run at significantly higher volume than your requirements.
[1] https://victoriametrics.com/blog/mimir-benchmark/
[2] https://docs.victoriametrics.com/victoriametrics/casestudies...
There were other problems with VictoriaMetrics - a failed migration attempt by previous engineers made it politically difficult to raise as a possibility, lack of a promise of full PromQL compatibility (too many PromQL dashboards built by too many teams), seeing features locked behind the Enterprise version (Mimir Enterprise had features added on top, not features locked away).
> HDD
You're right, I'm misremembering here, that particular complaint about a lack of Kafka was a Thanos issue, not VM.
That said, HDD is a hard sell to management. Seen as "not cloud native". People with old trauma from 100% full disks not expanded in time. Organizational perception that object storage does not need to be backed up (because redundancy is built into the object storage system) but HDD does (and automated backups are a VM Enterprise feature, and even more important if storing long-term metrics in VM).
> In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels
So are Thanos and Mimir, which suffer from ingest floods causing DoS, at least until Kafka was added. vminsert is billed as stateless, same as Thanos Receiver, same as Mimir Distributor. Not convinced.
In the back of my head there’s always the thought of dropping availability once we start discussing mutually exclusive operations.
What's your preferred solution for observability and monitoring of tiny apps?
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
I'm personally not convinced OpenTelemetry is the future. I get the desire to not be vendor-locked to a single provider, but Prometheus and Jaeger are very solid, battle-hardened, popular, well-maintained, easily self-hosted open-source projects. For small deployments you do not need to overthink things here - Grafana, Prometheus, Jaeger (with local disk storage), logging depends on how many machines you're talking about and where they're hosted (e.g. GCP Cloud Logging is fine for GCP-hosted projects, the 50 GB free tier is a lot for a small project) but as a default Loki is also just fine and much better than Elastic/OpenSearch.
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
> Grafana, Prometheus, Jaeger
This is a lot of infrastructure, we are talking about a tiny app here. Are you sure this is warranted?
Honestly I would prefer to have observability as a library, that's not feasible because of two factors, a) I really want distributed tracing (no microservices - I just want to combine traces from frontend and backend) so I need a place to join them, and b) it could/would lead to loss of traces when the program crashes.
In any case, it makes sense for me to choose tracing and metrics libraries that can output either OpenTelemetry or Prometheus and Jaeger, in the event that OpenTelemetry is not enough.
> Loki is also just fine and much better than Elastic/OpenSearch.
Wait, there is more?
> combine traces from frontend and backend
I'm scratching my head a little bit on what your expectation is here. Traces and real-user-monitoring are not the same thing here. Distributed tracing is specifically a microservices thing. Maybe all you're looking for is to just attach a UUID to each request and log it? Jaeger and Tempo aren't going to help you with frontend code.
> A lot of infrastructure
> Prometheus
You need something to tell you when your tiny app isn't running, so it can't be a library embedded into the app itself.
> Grafana
You need something with dashboards to help you understand what's going on. If your thing telling you your app has crashed is outside your app, the thing that helps visualize what happened before your app crashed also needs to be outside your app.
> Jaeger
Do you really need traces? Or just to log how long request times took, and have metrics for p50/p95/p99?
> Loki
If you're running only one instance of your app, you don't need it, just read through your logfiles with hl. If you have multiple machines, sending your logs to one place may not be necessary, but it's incredibly helpful; the alternative is basically ssh multiplexing...
OpenObserve might be what you're looking for
Thank you for that. I absolutely love that this uses tantivy.
I was previously leaning torwards VictoriaMetrics and VictoriaTraces (I will need both) but I think that OpenObserve is even simpler. Later I found Gigapipe/qryn https://github.com/metrico/gigapipe
Does OpenObserve ships something to view traces and metrics? (it appears that Gigapipe does). Or am I supposed to just use Grafana? I want to cut down on moving pieces.
OpenObserve has logs, metrics, traces, dashboards, RUM and alerts
>compatible with OpenTelemetry
Isn't OpenTelemetry very slow?
What do you suggest instead?
I'm looking at OpenTelemetry because of broad tooling compatibility (both Rust tracing crates, tracing and emit, support it - for logs, tracing and metrics) and it seems like something that will stick around.
Also I'm not sure I will ever need actual performance out of an observability solution; it's a tiny app after all.
What's the most promising alternative to Prometheus/Grafana if you're developing a new solution around OTEL? If you could start today and pick tools, what would you go for?
We also started with the typical kube-prometheus-stack, but we don’t like Prometheus/PromQL. Moreover, it only solves the „metrics“ part - to handle logs and traces, more quite heavy and complex components have to be added to the observability stack.
This didn‘t feel right, so we looked around and found greptimedb https://github.com/GreptimeTeam/greptimedb, which simplifies the whole stack. It‘s designed to handle metrics, logs, and traces. We collect metrics and logs via OpenTelemetry, and visualize them with Grafana. It provides endpoints for Postgres, MySQL, PromQL; we‘re happy to be able to build dashboards using SQL as that’s where we have the most knowledge.
The benchmarks look promising, but our k8s clusters aren’t huge anyway. As a platform engineer, we appreciate the simplicity of our observability stack.
Any other happy greptimedb users around here? Together with OTel, we think we can handle all future obs needs.
Hi there, I’m from the GreptimeDB team.
Thank you for giving GreptimeDB a shout-out—it means a lot to us. We created GreptimeDB to simplify the observability data stack with an all-in-one database, and we’re glad to hear it’s been helpful.
OpenTelemetry-native is a requirement, not an option, for the new observability data stack. I believe otel-arrow (https://github.com/open-telemetry/otel-arrow) has strong future potential, and we are committed to supporting and improving it.
FYI: I think SQL is great for building everything—dashboards, alerting rules, and complex analytics—but PromQL still has unique value in the Prometheus ecosystem. To be transparent, GreptimeDB still has some performance issues with PromQL, which we’ll address before the 1.0 GA.
Check OpenObserve https://github.com/openobserve/openobserve. It precisely was built to solve the challenges around grafana nd elastic. This is not a stack that you will need to weave together, just a single binary/container that would suffice for most users' needs - logs, metrics, traces, dashboards, alerts.
Disclosure: I am a maintainer of OpenObserve
Having been using Grafana community/cloud for a number of years, my new gig is currently moving everything to SigNoz. Mostly slick, under active development, communicative team, open source... what's not to love?
you can check out: https://github.com/SigNoz/signoz
open source and opentelemetry-native. Lots of our users have migrated from grafana to overcome challenges like having to handle multiple backends.
p.s - i am one of the maintainers.
Not sure what's an alternative for Grafana in the open source world in terms of building dashboards for o11y? I'm not aware of one and Grafana is used very extensively in my company...
I mentioned it in another reply, but https://perses.dev/ is probably the most promising alternative.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
How come I’ve never heard of Perses yet? A really Open Source, standardising Grafana clone to go alongside Prometheus for self-hosted deployments sounds just perfect!
For a hosted alternative, Dash0 also uses Perses for their dashboards.
Disclaimer: I am affiliated with them.
from cursory reading of the article I don’t see that author’s problems are specifically with Grafana in its best use case (metrics), but with other products from Grafana company, for which are a lot of alternatives.
Grafana dashboards itself (paired with VictoriaMetrics and occasionally Clickhouse) is one of the most pleasant web apps IMO. Especially when you don’t try to push the constraints of its display model, which are sometimes annoying but understandable.
I remember that alternative, free/FOSS products existed before Grafana (c2015) but many died, Grafana was everywhere. Now I also cannot find the old-alts. Vague memories of RRD and Nagios...
Munin was what we used for a while, along with a smattering of smokeping.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
I ran with Centreon for a while because you got Nagios + integrated dashboarding out of the box and a Community option.
I'm out of that game now though so don't have the challenge.
https://www.centreon.com/
https://github.com/opensearch-project/OpenSearch-Dashboards (Kibana fork) is one. But Grafana is still way better if you just stay away from anything that isn't the core product: data visualization and exploration (explorer and traces).
We’re using Greylog+Elastic Search which would totally replace a Loki-only stack.
You can check out: https://github.com/SigNoz/signoz
we're opentelemetry-native and apart from many out of box charts for APM, infra monitoring, and logs, you can also build customized dashboards with lots of visualization option.
p.s - i am one of the maintainers
I use Signoz for my private purposes, it's not a 100% match, but you can do Prometheus metrics, logs analysis, dashboards, alerts, OTEL spans so depending on your usecase it can be enough
I moved away from Grafana to Axiom and have not looked back
... ugh, they actually made an `o11[a-z]` abbreviation? When I picked this nick, the only term I ever saw in the wild was `i18n`.
K8s (Kubernetes), a11y (accessibility)...
The kicker for me recently was hearing someone say "ally"
a16z, l10n, s11n,
Or without numbers,
authC/authN, authZ...
I work in this area and even I don’t know what AuthC is!
They are absurd abbreviations. The first distinguishing letter comes right after auth, so ... let's hide it?
AuthN is supposed to mean authentication.
The problem is that authorization also has an "n" in the word.
Enter authC.
i d2t s1e t1e p5m...
o11y is not a word. What do you mean?
Observability , in the vein of accessibility which has the silly nickname of a11y
This is all coming from stenography, it's a well established shorthand for long words: first letter, count of middle letters, last letter.
No, it cane from DEC https://en.wikipedia.org/wiki/Numeronym#Numerical_contractio...
It seems unlikely to me that stenography would use this style because they have better ways of abbreviating long agglutinative words.
Which is similar to i18n for internationalization and l10n for localization.
> in the vein of accessibility which has the silly nickname of a11y
ironically that's not very accessible...
IMHO in a way the constant churn of Grafana made it easier to live with it. When they break your dashboards every major release you just learn to let go. And by break it’s not just making it error and not work, it’s the constant moving of things around, refactoring the UI, replacing one component with another, all accompanied by a number of glitches every time they rewrite things. You just accept and ignore it eventually.
What’s a bigger lock in for me is metrics and promql - you just can’t ever rename a poorly named metric or you face a world of pain. Or when Prometheus releases Native Histograms to replace the old ones, and suddenly everything from rules, alerts, ad-hoc queries and dashboards needs updating.
And PromQL is so opaque, it just never give you an error unless there is a syntax issue. We need tools like https://github.com/cloudflare/pint just to know if my alert description isn’t trying to render a label that’s just not gonna be there, etc
> And PromQL is so opaque, it just never give you an error unless there is a syntax issue. We need tools like https://github.com/cloudflare/pint just to know if my alert description isn’t trying to render a label that’s just not gonna be there, etc
PromQL should be blamed on prometheus though, not on grafana.
I frequently use a docker-compose template with prometheus pushgateway + grafana for deploying on single node servers, as described at the start of the article. It works well and is trivial to setup, but the complexity explodes once your metric volume or cardinality requires more scale like prometheus alternatives a la mimir.
I think this would not need to be an issue as frequently if prometheus had a more efficient publish/scraping mechanism. iirc there was once a protobuf metric format that was dropped, and now there is just the text format. While it wouldn't handle billions of unique labels like mimir, a compact binary metric format could certainly allow for millions at reasonable resolution instead of wasting all that scale potential on repeated name strings. I should be able to push or expose a bulk blob all at once with ordered labels or at least raw int keys.
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
logs, spans and metrics are stored as time-stamped stuff. sure simple fixed-width columnar storage is faster, and makes sense to special case for numbers (add downsampling and aggregations, and histogram maintenance and whatnot), but any write-optimized storage engine can handle this, it's not the hard part (basically LevelDB, and if there's need for scaling out it'll look like Cassandra, Aerospike, ScyllaDB, or ClickHouse ... see also https://docs.greptime.com/user-guide/concepts/data-model/ and specialized storage engines https://docs.greptime.com/reference/about-greptimedb-engines... )
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
Signoz is good, and active development https://github.com/SigNoz/signoz
I second SigNoz. I was paying a fortune for a cloud observability platform that cost more and more every month. Then I switched to self-hosted SigNoz on a cheap Hetzner box and now my observability stack costs $10 a month.
> I want stability for my monitoring; I want it boring, and that’s something Grafana is not offering
I used to be a fan of InfluxDB (back in the days of v1.x) then I went off it for exactly this reason.
But are there good alternatives to grafana in the foss space nowadays?
I only know of https://perses.dev/ but haven't had a look at it for ~half a year. It was very barebones back then but I'm hopeful it can replace Grafana for at least basic dashboarding soon.
This is pretty interesting to me, as I do use Grafana in my current role. But none of their other products, and not their helm chart (we're on the Bitnami chart if that's a thing).
So far it's pretty good. We're at least one major version behind, but hey everything still works.
I cannot imagine other products support as many data sources (though I'm starting to think they all suck, I just dump what I can in InfluxDB).
I agree. I think OP has made the mistake of using more than just Grafana for dashboards and perhaps user queries.
I operate a fairly large custom VictoriaMetrics-based Observability platform and have learned early on to only use Grafana as opposed to other Grafana products. Part of the stack used to use Mimir's frontend as caching layer but even that died with Mimir v3.0, now that it can't talk to generic Prometheus APIs anymore (vanilla Prom, VictoriaMetrics, promxy etc.). I went back to Cortex for caching.
Such a custom stack is obviously not for everyone and takes much more time, knowledge and effort to deploy than some helm chart but overall I'd say it did save me some headache. At least when compared to the Google-like deprecation culture Grafana seems to have.
I have found Grafana to be a decent product, but Prom needs a better horizontally scalable solution. We use Vector and Clickhouse for logging and works really well.
off topic, but prometheus pushgateway is such a bad implementation (once you push the metrics, it always stays there until it's restarted, like counter does not increase, it just pushes a new metric with the new value) that we had to write our own metrics collector endpoint.
That is literally how it is supposed to work. Prometheus grabs metrics --- that is how it works. If you for some reason find yourself unable to host an endpoint with metrics, you can use the fallback pushgateway to push metrics where yes they will stay until restarted. Ask yourself how it could ever work if they are subsequently deleted after read. How would multiple prometheus agents be able to read from the same source?
It sounds like you are using it for the wrong job. It’s supposed to be a solution for jobs / short running processes that don’t expose a /metrics endpoint for Prometheus long enough to be scraped and there you exactly want that kind of behavior.
The pushgateway is itself a horrible hack for the fact that prometheus is designed only for metrics scraping. Unfortunately the whole ecosystem around it is an utter mess.
Remote Write is a viable alternative in Prometheus and its drop-in replacements. I'm not a massive fan of it myself as I feel the pull-based approach is superior overall but still make heavy use of it.
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
Sounds like grafana needed to fork
If your observability stack works and you are fine with it, do you need to update it?
I understand updating some front facing service due to a vulnerability... But for a thing that it's internally accessible?
> Mimir in version 3.0 needs Apache Kafka to work.
I’d like to adjust this understanding. Kafka is the big new thing, but it’s optional. The previous way using gRPC still works.
I work on Mimir and other things at Grafana Labs.
Well the docs say the old way (no Kafka) is on its way out:
"However, this architecture is set to be deprecated in a future release."
So it doesn't stay optional unfortunately. It quite a heavy dependency to include...
Author here. I know the old way still works and I respect that. Given the history I ask myself how long will it work, since ist not the default anymore.
FTA > "I know for a fact that that pace is partially driven by career-driven development."
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
without any stability, you really can’t blame the player for playing this game.
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
Who remembers Graphite and Carbon? This was 2010 era…
This reads like a satire. There's so much jargon and so many products involved to just do a little bit of logging. It's ridiculous.
That is to say I agree with the author.
what are tested and fairly lightweight alternatives for Loki?
elastic stack is so heavy it's out of question for smaller clusters, loki integration with grafana is nice to have but separate capable dashboard would be also fine
I find the color of the text so light as to be unreadable.
The font doesn't render correctly on my device. It seems like as if some strokes are double, making lines inconsistent.
It’s challenging to read for me too.
the "dark" theme having a *blaring* white banner at the top is an ... interesting design choice
This is what keeps me from using things like these. Meanwhile I have uinx-oid scripts that keep chugging on up to date servers for more than a decade without any need to change them.
I know there is always the temptation to make it really shiny and nice. But the more moving parts your system has the liklier it becomes that something will fail eventually. And as it happens these failures happen usually at the time when it is most inconvenient for all people involved.
That doesn't mean that complex software cannot work reliably, but it takes more effort for the developer side to honor that unwritten contract with their users (if they are even aware of it).
This is why sometimes doing it yourself, on your own servers can be benefitial because it gives you more control.
This article comes off sort of low effort and mentions a lot grievances without actual pinpointing precise issues. I think leveraging OTEL as a general processor with a generic output is a good idea, but discounting Grafana for implementing multi tenancy solutions and alloy which is pretty fucking good is kind of pointless.
I feel this article in my bones. It's rough out there.
It’s as though they deliberately make it complex and constantly moving to wear you out to give in to the convenient SaaS offer.
As someone who runs SaaS products, this post resonates painfully well.
The author is 100% correct: Monitoring should be the most boring tool in the stack. Its one and only job is to be more reliable than the thing it's monitoring.
The moment your monitoring stack requires a complex dependency like Kafka, or changes its entire agent flow every 18 months, it has failed its primary purpose. It has become the problem.
This sounds less like a technical evolution and more like the classic VC-funded push to get everyone onto a high-margin cloud product, even at the cost of the open-source soul.