How to differentiate between auto call answer ivr and real audio of human

In my use case, no live human is initially involved, a detected ‘human’ is asked to answer a simple question simplistically “are you a human I can talk to?” , I prefer a second level “Hi there, what’s your name?”, STT checks the answer in a couple of seconds or less, a ‘yes’ response to the simplistic moves on, ‘Joanna’ or ‘Justin’ provides contextual progress indication if you want to be ‘human’ also.

(Passive aggressive first response before finding a live agent, “Please Hold Justin, You made me wait 22 minutes before you answered, I will be back with you in 23 seconds or so, OK?”)

IWFM