Enabling audio makes it easy to find a target based on their inherit characteristics. The bar is low to find, for example, a woman to harass in a videogame over voice comms (an exceedingly common problem).
FB, Reddit, et.al. require the target to actually exhibit a belief or opinion to become a target. It ironically raises the bar for being an asshole to a specific target when you can't hear them.
I do not think the solution to hate speech and discrimination is to hide the traits that make us different.
If the future of the metaverse is on the blockchain, each digital identity will have a reputation to uphold. I think transparency in people's actions, a reputation system, and a simple moderation system is sufficient to manage toxic behavior.
I am not sure if the issue stems from enabling audio. We see this over text on facebook, reddit, and pretty much every gaming platform.