For decades, the diagnostic interview has served as the bedrock of psychiatry. Whether a patient is seeking help for depression, anxiety, or bipolar disorder, the process is largely the same: a clinician sits down with a script, asks a series of standardized questions, and arrives at a diagnosis based on the answers. It is the industry’s "gold standard."
But a new review published in JAMA Network Open suggests that this standard is far from definitive. By analyzing "test-retest reliability"—a measure of whether the same patient receives the same diagnosis when interviewed twice—researchers found that the consistency of these tools varies wildly depending on the condition being assessed. In many cases, the diagnostic process is less of a precise measurement and more of a moving target.
Why the Numbers Don't Always Add Up
Laura Duncan, a psychiatry professor at McMaster University and a lead author of the study, argues that the field has long relied on these interviews simply because there is no better alternative. "They are often treated as a ‘gold standard’ for assessing mental disorders," Duncan said, "but they fall short of providing a definitive benchmark that demonstrates excellent validity and reliability."
To measure this, the research team utilized Cohen’s kappa coefficient, a statistical tool that accounts for the likelihood of a diagnosis occurring by pure chance. The results revealed a clear hierarchy of reliability. Substance use disorders, particularly opioid use disorder, showed the highest consistency.
Duncan attributes this to the nature of the criteria. It is objectively easier for a patient to report the number of drinks consumed in a week than it is to quantify the subjective, fluctuating intensity of anxiety or depressive episodes. When the criteria are rooted in observable behavior, the diagnosis tends to be more stable. When they are rooted in internal emotional states, the reliability drops.
The Tension Between Structure and Nuance
Not everyone in the field is satisfied with the study’s broad conclusions. Dr. Michael First, a professor at Columbia University and a primary author of the Structured Clinical Interview for DSM-5 (SCID), argues that the study fails to distinguish between the different types of tools currently in use.
"It’d be nice to be able to look at this and say: ‘Oh, based upon this paper, I should pick this one because of this,’" First said. "But there’s simply not enough information here."
First points to a fundamental trade-off in psychiatric assessment: the difference between "fully structured" and "semi-structured" interviews.
- Fully structured interviews are rigid. They are designed for large-scale epidemiological research where the interviewer may have minimal clinical training. Because the interviewer cannot deviate from the script, these tools are highly consistent—but they lack the ability to probe contradictory or vague patient responses.
- Semi-structured interviews are designed for trained clinicians. They allow the provider to "ad-lib" follow-up questions to clarify a patient's experience. While this leads to a more nuanced clinical picture, it also introduces more variability, as the flow of the conversation changes from one session to the next.
What Experts Say
While the study highlights a lack of rigor in how diagnostic tools are reported and compared, it also underscores a 50-year-old frustration in psychiatry: the lack of objective, biological markers for mental illness.
"We’ve been saying that for 50 years," First noted, referring to the hope that laboratory tests might one day replace or supplement the interview process. Until then, the field remains stuck in a cycle of relying on tools that are known to be imperfect. Duncan acknowledges that the data required to perform a granular, tool-by-tool comparison simply does not exist yet, as many published papers fail to clearly report the specific format or structure of the interview used.
Key Takeaways
- Diagnostic interviews, while widely used as the "gold standard," show inconsistent reliability across different mental health conditions.
- Conditions based on observable behaviors, such as substance use disorders, currently yield more reliable diagnostic results than those based on internal emotional states.
- There is a significant gap in the research regarding how different interview formats—fully structured versus semi-structured—impact diagnostic accuracy and consistency.
The Path Forward
The next step for the field is not necessarily to abandon the interview, but to demand better data on the tools themselves. The lack of transparency in how these interviews are conducted in research settings is a systemic issue that prevents clinicians from choosing the most reliable instruments for their patients. As researchers prepare for the next wave of clinical trials in 2026, the focus must shift toward standardizing how these tools are reported. Until then, clinicians and patients alike should view the diagnostic interview as a starting point for a conversation, rather than a final, immutable verdict.
This article is for informational purposes only. Always consult a qualified healthcare professional before making any medical decisions.