Warning please don’t blindly follow the steps here without doing your own analysis of the risks involved, and ideally without getting Oracle support involved, I was hesitant to publish this, but as I’ve been in contact with someone else and it’s helped them workaround (to some extent) an issue they have been having I think it’s worth putting out there.
We had a problem during a dataguard switch-over (luckily planned switch-over for patching rather than a disaster situation) where Grid Infrastructure (clusterware) was unable to bring up one of the databases, it kept throwing “ORA-01017: invalid username/password”. Starting the database the ‘traditional way’ using “sqlplus / as sysdba” had no such problems.
Reviewing Oracle Support, particularly Doc ID 2313555.1 we identified some non-standard configuration in the Oracle home used for this database, but even after resolving them, the error persisted.
At times like these you realize (or at least I did) how little is published about the internals of how clusterware and the oracle databases it manages interact.
I suspected that restarting the entire clusterware stack would resolve the issue but that was difficult as this node also managed a production database which we didn’t want to take down.
However I guessed that restarting the clusterware agent for the oracle user might fix the problem. The executable is oraagant.bin and the process owner is oracle. I believe this is the process clusterware uses to actually start the database (You’ll also probably notice a similar process owned by grid and orarootagent.bin running as root).
I killed the oracle agent process and crossed my fingers. Luckily clusterware re-spawned this process and afterwards we were able to restart the problem instance without any problems.
Please re-read the first paragraph if you are considering to apply this work-around, and don’t blame me if you break anything, if it helps though I’m happy to take the credit!