Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yi 34B Chat has not done well on my new NYT Connections benchmark and it's only in the 22nd place on the LMSYS Elo-based leaderboard (151 Elo below GPT 4 Turbo). It's doing better in Chinese. When it comes to models with open-sourced weights, Qwen 72B is clearly stronger.


Ooh I also use connections as a benchmark! It tends to favour things with 'chain of thought' style reasoning in the training mix somewhere since directly producing the answer is hard. Do you have public code you could share?


I'd also love to see how you're doing this benchmarking!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: