A race condition in Aurora RDS

237 points by theanomaly a day ago

gtowey a day ago

This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.

Not that I'm discounting the author's experience, but something doesn't quite add up:

- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.

- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

twisteriffic a day ago

> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"
- perching_aix a day ago
  
  An experience not exclusive to cloud vendors :) Even better when the vendor throws their hands up cause the issue is not reliably repro'able.
  That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.
  At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.
  - pixl97 a day ago
    
    I've had customers with load related bugs for years simply because they'd reboot when the problem happened. When dealing with the F100 it seems there is a rather limited number of people in these organizations that can troubleshoot complex issues, that or they lock them away out of sight.
    
    perching_aix a day ago
    
    It is a tough bargain to be fair, and it is seen in other places too. From developers copying out their stuff from their local git repo, recloning from remote, then pasting their stuff back, all the way to phone repair just meaning "here's a new device, we synced all your data across for you", it's fairly hard to argue with the economic factors and the effectiveness of this approach at play.
    With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.
    What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]
    Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.
    [0] as delightfully subjective as those get of course
    
    hobs a day ago
    
    If I had a nickel for every time I had to explain that rebooting a database server is usually the wrong choice I would have quite a fortune.
- sally_glance a day ago
  
  Theoretically you're supposed to assign lower prio to issues with known workarounds but then there should also be reporting for product management (which assigns weight by age of first occurrence and total count of similar issues).
  Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.
- pwarner a day ago
  
  Azure yes, I'd expect this and the restart would take many minutes. Been there done that.
  AWS this is surprising
theanomaly a day ago

I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.
kobalsky a day ago

> - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:
1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode
This basically a worst case scenario in a transactional system.
I was basically screaming like a mad man in the corner but no one seemed to care.
Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.
The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.
https://stackoverflow.com/questions/945482/why-doesnt-anyone...
- sroussey a day ago
  
  Converting a connection to autocommit upon error. Yikes!!
  - evanelias a day ago
    
    If I'm reading this correctly, it sounds like the connection was already using autocommit by default? In that situation, if you initiate a transaction, and then it gets rolled back, you're back in autocommit unless/until you initiate another transaction.
    If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.
    
    o11c 21 hours ago
    
    I would argue that it's a bug for it even to be possible to autocommit.
    
    evanelias 19 hours ago
    
    What do you mean? Autocommit mode is the default mode in Postgres and MS SQL Server as well. This is by no means a MySQL-specific behavior!
    When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.
    The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.
    The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.
    
    o11c 19 hours ago
    
    Yes, and most DBMS's are full of historical mistakes.
    In a sane world, statements outside `BEGIN` would be an unconditional error.
    
    evanelias 3 hours ago
    
    Lack of autocommit would be bad for performance at scale, since it would add latency to every single query. And the MVCC implications are non-trivial, especially for interactive queries (human taking their time typing) while using REPEATABLE READ isolation or stronger... every interactive query would effectively disrupt purge/vacuum until the user commits. And as the sibling comment noted, that would be quite harmful if the user completely forgets to commit, which is common.
    In any case, that's a subjective opinion on database design, not a bug. Anyway it's fairly tangential to the client library bug described up-thread.
    
    grogers 7 hours ago
    
    Autocommit mode is pretty handy for ad-hoc queries at least. You wouldn't want to have to remember to close the transaction since keeping a transaction open is often really bad for the DB
aetherson a day ago

My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.
- everfrustrated a day ago
  
  If you pay for the highest level of support you will get extremely good support. But it comes with signing a NDA so you're not going to read about anything coming out of it on a blog.
  I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.
  - qaq a day ago
    
    Nah if your monthly spend is really significant than you will get good support and issues you care about will get prioritized. Going from startup with 50K/month spend to a large company with untold millions per month spend experience is night and day. We have Dev managers and eng. from key AWS teams present in meetings when need be, we get issues we raise prioritized and added to dev roadmaps etc.
  - aetherson 14 hours ago
    
    I was at a company that spent over $90M a year with AWS and we got defensive, limited comms.
maherbeg a day ago

Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws
- CaptainKanuk a day ago
  
  Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.
  Their docs are not good and things frequently don't behave how you expect them to.
- ekropotin 18 hours ago
  
  Oh, well, it’s always DNS!
grogers 6 hours ago

It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.
It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.
Hovertruck a day ago

Agreed, we've been running multiple aurora clusters in production for years now and have not encountered this issue with failovers.
- dalyons a day ago
  
  Same. There’s something missing here.
  - OrangeMaple a day ago
    
    [dead]
nijave a day ago

fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS
benmmurphy a day ago

it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.
nrhrjrjrjtntbt a day ago

P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.
biggoodwolf a day ago

I recall seeing this also happening in CosmosDB. Both auto and manual
belter 9 hours ago

The article is low quality. It does not mention which Aurora PostgreSQL version was involved, and it provides no real detail about how the staging environment differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.
This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.
dboreham a day ago

Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out. Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.

time0ut a day ago

Wow. This is alarming.

We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.

Just one more thing to worry about I guess…

dangoodmanUT a day ago

Not really: Their storage layer worked perfectly and prevented the ACID violations.

grhmc a day ago

Yikes! This is exactly the kind of invariant I'd expect Aurora to maintain on my behalf. It is why I pay them so much...

dangoodmanUT a day ago

It did, the storage layer did not allow for concurrent writes.

shayonj 9 hours ago

Sadly, its not the first time I have noticed unexpected and odd behaviors from Aurora PostgreSQL offering.

I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.

During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.

My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.

halifaxbeard a day ago

I think OP is wrong in their hypothesis based on the logs they share and the root cause AWS support provided them.

I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.

The message about the storage subsystem going away is after the other Postgres process was kill -9’d.

jansommer a day ago

People who have experience with Aurora and RDS Postgres: What's your experience in terms of performance? If you dont need multi A-Z and quick failover, can you achieve better performance with RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming everything else can deliver that and cpu/mem isn't the bottleneck)? Aurora seems to be especially slow for inserts and also quite expensive compared to what I get with RDS when I estimate things in the calculator. And what's the story on read performance for Aurora vs RDS? There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

nijave a day ago

We've seen better results and lower costs in a 1 writer, 1-2 reader setup on Aurora PG 14. The main advantages are 1) you don't re-pay for storage for each instance--you pay for cluster storage instead of per-instance storage & 2) you no longer need to provision IOPs and it provides ~80k IOPs
If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.
The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).
I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.
There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before
I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless
- jansommer a day ago
  
  Have you benchmarked your load on RDS? [0] says that IOPS on Aurora is vastly different from actual IOPS. We have just one writer instance and mostly write 100's of GB in bulk.
  [0] https://dev.to/aws-heroes/100k-write-iops-in-aurora-t3medium...
  - YouAreWRONGtoo 13 hours ago
    
    [dead]
Scubabear68 a day ago

For me, the big miss with Postgres Aurora RDS was costs. We had some queries that did a fair amount of I/O in a way that would not normally be a problem, but in the Aurora Postgres RDS world that I/O was crazy expensive. A couple of fuzzy queries blew costs up to over $3,000/month for a database that should have cost maybe $50-$100/month. And this was for a dataset of only about 15 million rows without anything crazy in them.
- Hexcles a day ago
  
  Sounds like you need to use IO optimized storage billing mode.
Exoristos a day ago

We were burned by Aurora. Costs, performance, latency, all were poor and affected our product. Having good systems admins on staff, we ended up moving PostgreSQL on-prem.
paranoidrobot a day ago

My experience is with Aurora MySQL, not postgres. But my understanding is that the way the storage layer works is much the same.
We have some clusters with very high write IOPS on Aurora.
When looking at costs we modelled running MySQL and regular RDS MySQL.
We found for the IOPS capacity of Aurora we wouldn't be able to match it on AWS without paying a stupid amount more.
everfrustrated a day ago

Aurora doesn't use EBS under the hood. It has no option to choose storage type or io latency. Only a billing choice between pay per io or fixed price io.
- jansommer a day ago
  
  Precisely! That's why RDS sounds so interesting. I get a lot more knobs to tweak performance, but I'm curious if a maxed out gp3 with instances that support it is going to fare any better than Aurora.
jaggederest a day ago

I've had better results with managing my own clusters on metal instances. You get much better performance with e.g. NVMe drives in a 0+1 raid (~million iops in a pure raid 0 with 7 drives) and I am comfortable running my own instances and clusters. I don't care for the way RDS limits your options on extensions and configuration, and I haven't had a good time with the high availability failovers internally, I'd rather run my own 3 instances in a cluster, 3 clusters in different AZs.
Blatant plug time:
I'm actually working for a company right now ( https://pgdog.dev/ ) that is working on proper sharding and failovers from a connection pooler standpoint. We handle failovers like this by pausing write traffic for up to 60 seconds by default at the connection pooler and swapping which backend instance is getting traffic.
shawabawa3 a day ago

> 3125 throughput
Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?
- jansommer a day ago
  
  This is super confusing. Check out the RDS Postgres calculator with gp3:
  > General Purpose SSD (gp3) - Throughput > gp3 supports a max of 4000 MiBps per volume
  But the docs say 2000. Then there's IOPS... The calculator allows up to 64.000 but on [0], if you expand "Higher performance and throughout" it says
  > Customers looking for higher performance can scale up to 80,000 IOPS and 2,000 MiBps for an additional fee.
  [0] https://aws.amazon.com/ebs/general-purpose/
  - nijave a day ago
    
    RDS PG stripes multiple gp3 volumes so that's why RDS throughput is higher than gp3
    I think 80k IOPs on gp3 is a newer release so presumably AWS hasn't updated RDS from the old max of 64k. iirc it took a while before gp3 and io2 were even available for RDS after they were released as EBS options
    Edit: Presumably it takes some time to do testing/optimizations to make sure their RDS config can achieve the same performance as EBS. Sometimes there are limitations with instance generations/types that also impact whether you can hit maximum advertised throughput
    
    mkesper a day ago
    
    Only if you allocate (and pay for) more than 400GB. And if you have high traffic 24/7 beware of "EBS optimized" instances which will fall down to baseline rates after a certain time. I use vantage.sh/rds (not affiliated) to get an overview of the tons of instance details stretched out over several tables in AWS docs.
- nijave a day ago
  
  RDS stripes multiple gp3 volumes. Docs are saying 4Gi/s per instance is the max for gp3 if I'm looking at the right table
belter 5 hours ago

> There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.
Do you have a problem believing these claims on equivalent hardware?: https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
Or do your own performance assessments, following the published document and templates available so you can find the facts on your own?
For Aurora MySql:
"Amazon Aurora Performance Assessment Technical Guide" - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
For Aurora Postgres:
"...Steps to benchmark the performance of the PostgreSQL-compatible edition of Amazon Aurora using the pgbench and sysbench benchmarking tools..." - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...
"Automate benchmark tests for Amazon Aurora PostgreSQL" - https://aws.amazon.com/blogs/database/automate-benchmark-tes...
"Benchmarking Amazon Aurora Limitless with pgbench" - https://aws.amazon.com/blogs/database/benchmarking-amazon-au...

dangoodmanUT a day ago

This confirms a lot of what their engineers preach: The lego brick model.

They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.

Good job AWS engineering!

d1egoaz a day ago

> AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.

Is there a case number where we can reach out to AWS regarding this recommendation?

paranoidrobot a day ago

Yeah. I'd like this too.
We use Aurora MySQL but I would like to be able to point to that and ask if it applies to us.

robinduckett a day ago

Glad to know I’m not crazy.

theanomaly a day ago

AWS Support initially pushed back and suggested it's because of high replication lag but they were looking at metrics that were more than 24 hours old. What kind of failure did you encounter? I really want to understand what edge case we triggered in their failover process - especially since we could not reproduce it in other regions.
- robben1234 a day ago
  
  My cluster recently started to failover every few days whenever it experiences the load to trigger scale up from 1-2 to 20+ acu.
  And then I also encountered errors just like op in my app layer about trying to execute a write query via read-only transaction.
  The workaround so far is to invalidate connection on error. When app reconnects the cluster write endpoint correctly leads to current primary.

bob1029 a day ago

> Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage.

I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.

redwood a day ago

A good reminder of how people developing a mental model of adding read replicas as a way to scale is a slippery slope. At the end of the day you're scaling only one specific part of your system with certain consistency dynamics that are difficult to reason about

terminalshort a day ago

Works fine for workloads like:
1. I need to grab some rows from a table
2. Eventual consistency is good enough
And that's a lot of workloads.
- candiddevmike a day ago
  
  As a user, I've come to realize the situations where I think eventual consistency (or delayed processing) are good enough aren't the same as the folks developing most products. Nothing annoys me more than stuff not showing up immediately or having to manually refresh.
  - darth_avocado a day ago
    
    Sometimes users want everything to show up immediately, but not pay extra for the feature. Everything real time is expensive. Eventual consistency is a good thing for most systems.
  - terminalshort a day ago
    
    For a workload where you need true read after write you can just send those reads to the writer. But even if you don't there are plenty of workarounds here. You can send a success response to the user when the transaction commits to the writer and update the UI on response. The only case where this will fail is if the user manually reloads the page within the replication lag window and the request goes to the reader. This should be exceedingly rare in a single region cluster, and maybe a little less rare in a multi-region set up, but still pretty rare. I almost never see > 1s replication lag between regions in my Aurora clusters. There are certainly DB workloads where this will not be true, but if you are in a high replication lag cluster, you just don't want to use that for this type of UI dependency in the first place.
  - nilamo a day ago
    
    I think the key here is just proper notifications. Yes it's eventually consistent, but having a "processing" or "update in progress" is a huge improvement over showing a user old data.
- morshu9001 a day ago
  
  That's readonly. RW workloads usually don't tolerate eventual consistency on the thing they're writing.
  - terminalshort a day ago
    
    Yeah, if you have a mix of reads and writes in a workflow, you gotta hit the writer node. But a lot of times an endpoint is only reading data from a particular DB.
- redwood a day ago
  
  The future you or future team member may struggle to reason about that in the future
nijave a day ago

You can hit the same problems horizontally scaling compute. One instance reads from the DB, a request hits a different instance which updates the DB. The original instance writes to the DB and overwrites the changes or makes decisions based on stale data.
More broadly a distributed system problem

halfmatthalfcat 12 hours ago

CC pm. MgtzkskskzjauHjhffd

almosthere a day ago

probably should have added postgres to end of title

evanelias a day ago

Absolutely this. The differences between Aurora Postgres and Aurora MySQL are quite significant. A failover bug affecting one doesn't imply the same bug exists in the other.
A lot of people seem to have the misconception that "Aurora" is its own unique database system, with different front-ends "pretending" to be Postgres or MySQL, but that isn't the case at all.

ldkge a day ago

Am I the only one who misread that as “AI race condition”?

YouAreWRONGtoo 13 hours ago

[dead]