Yi 34B Chat has not done well on my new NYT Connections benchmark and it's only in the 22nd place on the LMSYS Elo-based leaderboard (151 Elo below GPT 4 Turbo). It's doing better in Chinese. When it comes to models with open-sourced weights, Qwen 72B is clearly stronger.
Ooh I also use connections as a benchmark! It tends to favour things with 'chain of thought' style reasoning in the training mix somewhere since directly producing the answer is hard. Do you have public code you could share?