U-MATH / μ-MATH leaderboard
These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems. U-MATH provides a set of 1,100 university-level mathematical problems, while µ-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.
Visible Columns:
10 | 🟥 | 🚀 | meta-llama/Llama-3.2-90B-Vision-Instruct | Open-Weights | 684.5 | 684.5 | Qwen2VLForConditionalGeneration | DeepSeek | https://huggingface.co/unsloth/Llama-3.1-Nemotron-70B-Instruct | gpt-4o-2024-08-06 | 86.8 | 93.1 | 58.5 | 90.5 | 76.8 | 80.7 | 57.1 | 83.2 | 90.7 | 63.8 | 89.4 | 97.3 | 16.7 | 87.1 | 85.3 | 60.7 | 96.2 | 99.3 | 50 | 92.9 | 93.3 | 75 |
10 | 🟥 | 🚀 | 86.8 | 93.1 | 58.5 | 90.5 |
1 | 🟥 | ❓ | o1 | 86.8 | 93.1 | 58.5 | 90.5 |
2 | 🟥 | ❓ | gemini-2.0-flash-thinking-exp-01-21 | 83.6 | 89.2 | 58.5 | 86.2 |
3 | 🟥 | ❓ | o3-mini | 82.2 | 92.8 | 34.5 | 89.5 |
4 | 💙 | 🚀 | 80.7 | 91.3 | 33 | 88.2 | |
5 | 🟥 | ❓ | o1-mini | 76.3 | 82.9 | 46.5 | 75.8 |
6 | 💙 | 🚗 | 73.1 | 82.7 | 30 | 75.8 | |
7 | 💙 | 🚚 | 65 | 69.7 | 44 | 57.2 | |
8 | 💙 | 🚀 | 62.6 | 69.3 | 32.5 | 57.5 | |
9 | 🟥 | ❓ | gemini-1.5-pro | 60.1 | 63.4 | 45 | 50.5 |
10 | 💙 | 🚚 | 59.5 | 68.7 | 18 | 57 | |
11 | 🟥 | ❓ | gemini-1.5-flash | 57.8 | 61.2 | 42.5 | 48.5 |
12 | 💙 | 🚚 | 54.9 | 62.9 | 19 | 49.8 | |
13 | 💙 | 🚗 | 54.5 | 58.3 | 37 | 43.2 | |
14 | 💙 | 🚗 | 52.4 | 60.4 | 16 | 46.3 | |
15 | 💙 | 🚚 | 51.2 | 58.9 | 16.5 | 44.7 | |
16 | 🟥 | ❓ | gpt-4o-2024-08-06 | 50.2 | 53.9 | 33.5 | 38.3 |
17 | 💙 | 🚀 | 47.8 | 51.4 | 31.5 | 38.2 | |
18 | 💙 | 🚀 | 47.6 | 55.6 | 12 | 42.5 | |
19 | 💙 | 🚗 | 45.5 | 53 | 11.5 | 38 | |
20 | 💙 | 🚚 | 44.7 | 51.7 | 13.5 | 39.5 | |
21 | 🟥 | ❓ | gpt-4o-mini-2024-07-18 | 43.4 | 47.2 | 26 | 30 |
22 | 💙 | 🚗 | 43.3 | 50.4 | 11 | 34.5 | |
23 | 💙 | ❓ | 42.5 | 47.7 | 19.5 | 33.7 | |
24 | 💙 | 🚚 | 41.8 | 43.9 | 32.5 | 29.3 | |
25 | 🟥 | ❓ | claude-sonnet-3.5 | 38.7 | 40.7 | 30 | 26.2 |
26 | 💙 | 🚚 | 37.2 | 41.8 | 16.5 | 24.7 | |
27 | 💙 | 🚗 | 34.8 | 39.9 | 12 | 22 | |
28 | 💙 | 🚚 | 34.3 | 39.6 | 10.5 | 22.8 | |
29 | 💙 | 🚗 | 29.5 | 33.7 | 11 | 22.8 | |
30 | 💙 | 🚗 | 26.3 | 27.1 | 22.5 | 15.3 | |
31 | 🟥 | ❓ | LFM-7B | 25.8 | 28 | 16 | 10.7 |
32 | 💙 | 🚗 | 23.1 | 26.9 | 6 | 13.5 | |
33 | 💙 | 🚗 | 20.4 | 22.9 | 9 | 10.3 | |
34 | 💙 | 🚗 | 17.5 | 17.9 | 16 | 8.8 |
Visible Columns:
10 | 🟥 | 🚗 | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | DeepSeek | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 | 88.4 | 88.9 | 87.9 | 88.9 | 87.9 | 90.6 | 95.3 | 84.6 | 91.1 | 91.7 | 93.2 | 90.5 | 95.8 | 90.5 | 95.8 | 83.8 | 86.8 | 80.6 | 84.6 | 83.3 |
1 | 🟥 | ❓ | o1 | o1 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 | 88.4 | 88.9 | 87.9 | 88.9 | 87.9 | 90.6 | 95.3 | 84.6 | 91.1 | 91.7 | 93.2 | 90.5 | 95.8 | 90.5 | 95.8 | 83.8 | 86.8 | 80.6 | 84.6 | 83.3 |
2 | 🟥 | ❓ | o1-mini | o1-mini | Proprietary | null | 150 | Unknown | Unknown | GPT | null | Qwen/Qwen2.5-72B-Instruct | 84.8 | 83.3 | 86.2 | 85.8 | 83.8 | 81.2 | 77.8 | 84.8 | 84.8 | 77.8 | 86.2 | 88.4 | 84.6 | 90.5 | 81.5 | 89.7 | 85.7 | 93.8 | 85.7 | 93.8 | 79.5 | 81.6 | 77.4 | 81.6 | 77.4 |
3 | 💙 | 🚗 | Qwen/QwQ-32B-Preview | Open-Weights | 32.8 | 32.8 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/QwQ-32B-Preview | Qwen/Qwen2.5-72B-Instruct | 83.2 | 91.3 | 75.4 | 78.7 | 89.7 | 78 | 86.1 | 69.7 | 75.6 | 82.1 | 80 | 95.3 | 61.5 | 80.4 | 88.9 | 84 | 90.5 | 83.3 | 70.4 | 95.2 | 86.7 | 92.1 | 80.6 | 85.4 | 89.3 | ||
4 | 💙 | 🚀 | deepseek-ai/DeepSeek-R1 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | Unknown | DeepSeek | https://huggingface.co/unsloth/DeepSeek-R1 | Qwen/Qwen2.5-72B-Instruct | 82.2 | 76.8 | 87.7 | 86.2 | 79.1 | 79.7 | 72.2 | 87.9 | 86.7 | 74.4 | 82 | 81.4 | 84.6 | 89.7 | 73.3 | 88.2 | 85.7 | 91.7 | 81.8 | 93.6 | 76.8 | 71.1 | 83.9 | 84.4 | 70.3 | |
5 | 🟥 | ❓ | gemini-2.0-flash-thinking-exp-01-21 | gemini-2.0-flash-thinking-exp-01-21 | Proprietary | null | 150 | Unknown | Unknown | Gemini | null | Qwen/Qwen2.5-72B-Instruct | 81 | 89.1 | 73.2 | 76.9 | 87.1 | 74.3 | 91.7 | 57.6 | 70.2 | 86.4 | 85.8 | 93 | 76.9 | 87 | 87 | 83.3 | 81 | 87.5 | 73.9 | 91.3 | 76 | 86.8 | 64.5 | 75 | 80 |
6 | 🟥 | ❓ | google/gemini-1.5-pro | google/gemini-1.5-pro | Proprietary | null | 150 | Unknown | Unknown | Gemini | null | Qwen/Qwen2.5-72B-Instruct | 80.7 | 77.5 | 84.5 | 85.2 | 76.4 | 78.2 | 76.4 | 80.2 | 80.5 | 76.1 | 79.5 | 81 | 82.9 | 91.6 | 65.4 | 83.6 | 75.3 | 90.8 | 82 | 86.8 | 77.7 | 75.5 | 81 | 84.2 | 71.2 |
7 | 🟥 | ❓ | gpt-4o-2024-08-06 | gpt-4o-2024-08-06 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | Qwen/Qwen2.5-72B-Instruct | 77.4 | 70.1 | 85.9 | 85.1 | 71.3 | 77.5 | 72.1 | 83.2 | 82.1 | 73.6 | 72.6 | 70.4 | 82.9 | 90.5 | 54.8 | 81.8 | 71.1 | 90.8 | 81.2 | 84.9 | 74.2 | 67.1 | 83.6 | 84.6 | 65.5 |
8 | 💙 | 🚀 | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 122.6 | 122.6 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | Qwen/Qwen2.5-72B-Instruct | 76.6 | 75.7 | 77.7 | 79.7 | 73.5 | 76 | 75.7 | 76.3 | 77.4 | 74.6 | 75 | 79.9 | 73.2 | 87.3 | 61.2 | 78.6 | 75.3 | 82.8 | 70.9 | 85.7 | 72.5 | 71 | 75 | 79.1 | 65.9 | ||
9 | 💙 | 🚚 | Qwen/Qwen2.5-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-72B-Instruct | Qwen/Qwen2.5-72B-Instruct | 75.6 | 77.1 | 74.2 | 77.5 | 73.7 | 73.7 | 76.4 | 71 | 73.8 | 73.8 | 74.2 | 79.4 | 72 | 86.7 | 60.2 | 79.3 | 75.3 | 83.9 | 72.3 | 85.9 | 70.5 | 76.1 | 64.7 | 74.2 | 67 | ||
10 | 🟥 | ❓ | google/gemini-1.5-flash | google/gemini-1.5-flash | Proprietary | null | 150 | Unknown | Unknown | Gemini | null | Qwen/Qwen2.5-72B-Instruct | 74.8 | 63.3 | 88.3 | 86.2 | 67.6 | 70.1 | 57.9 | 84 | 79.4 | 65.1 | 73.9 | 67.7 | 91.5 | 94.8 | 55.1 | 80.6 | 67 | 92 | 82.3 | 83.3 | 71.2 | 60.6 | 85.3 | 84.7 | 61.9 |
11 | 🟥 | ❓ | claude-sonnet-3-5 | claude-sonnet-3-5 | Proprietary | null | 150 | Unknown | Unknown | Claude | null | Qwen/Qwen2.5-72B-Instruct | 74.8 | 62.5 | 89.5 | 87.3 | 67.4 | 72.2 | 57.9 | 88.5 | 84.4 | 66.3 | 73.8 | 70.9 | 85.4 | 91.8 | 56 | 77.9 | 59.8 | 93.1 | 82.9 | 80.6 | 70.8 | 58.1 | 87.9 | 86.5 | 61.1 |
12 | 💙 | 🚚 | Qwen/Qwen2.5-Math-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-72B-Instruct | Qwen/Qwen2.5-72B-Instruct | 74 | 80.9 | 66.8 | 73.8 | 75.2 | 68.2 | 77.9 | 58.8 | 66.9 | 71.3 | 76.8 | 82.5 | 73.2 | 87.6 | 64.5 | 77.3 | 81.4 | 76.4 | 65.8 | 88.1 | 69.3 | 81.3 | 56.9 | 71.6 | 69.5 | ||
13 | 🟥 | ❓ | gpt-4o-mini-2024-07-18 | gpt-4o-mini-2024-07-18 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | Qwen/Qwen2.5-72B-Instruct | 72.3 | 59 | 88.1 | 85.1 | 65.1 | 70.4 | 56.4 | 86.3 | 81.4 | 64.9 | 69.6 | 63 | 87.8 | 92.2 | 50.7 | 76.2 | 59.8 | 90.2 | 77.3 | 80.1 | 69.3 | 56.1 | 87.1 | 85.3 | 59.8 |
14 | 💙 | 🚗 | Qwen/Qwen2.5-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-7B-Instruct | Qwen/Qwen2.5-72B-Instruct | 69.3 | 78.7 | 59.8 | 69.3 | 70.8 | 68.3 | 81.4 | 55.7 | 66.3 | 73.7 | 69.1 | 79.4 | 59.8 | 82 | 55.7 | 72.3 | 78.4 | 70.1 | 59.4 | 85.3 | 62.4 | 75.5 | 49.1 | 66.5 | 60 | ||
15 | 💙 | 🚗 | Qwen/Qwen2.5-Math-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-7B-Instruct | Qwen/Qwen2.5-72B-Instruct | 61.9 | 76.6 | 47.9 | 62.9 | 63.9 | 57.2 | 75 | 41.2 | 57.7 | 60.7 | 63.8 | 77.8 | 50 | 78.2 | 49.4 | 63.8 | 85.6 | 51.7 | 49.7 | 86.5 | 59.7 | 71 | 48.3 | 64.7 | 55.4 | ||
16 | 💙 | 🚚 | meta-llama/Llama-3.1-70B-Instruct | Open-Weights | 70.6 | 70.6 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-70B-Instruct | Qwen/Qwen2.5-72B-Instruct | 61 | 62.5 | 59.6 | 64.1 | 57.9 | 69.4 | 67.1 | 71.8 | 71.8 | 67.1 | 58.8 | 61.4 | 61 | 78.4 | 40.7 | 57 | 63.9 | 54 | 43.7 | 72.9 | 56 | 58.7 | 53.4 | 62.8 | 49.2 | |
17 | 💙 | 🚗 | mistralai/Ministral-8B-Instruct-2410 | Open-Weights | 8 | 8 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Ministral-8B-Instruct-2410 | Qwen/Qwen2.5-72B-Instruct | 60.5 | 55.9 | 65.8 | 65.4 | 56.4 | 62.9 | 53.6 | 73.3 | 68.2 | 59.6 | 58.3 | 63 | 57.3 | 77.3 | 40.2 | 63.1 | 59.8 | 67.8 | 50.9 | 75.2 | 52.8 | 47.1 | 60.3 | 61.3 | 46.1 | ||
18 | 💙 | 🚗 | meta-llama/Llama-3.1-8B-Instruct | Open-Weights | 8 | 8 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-8B-Instruct | Qwen/Qwen2.5-72B-Instruct | 52 | 48.7 | 55.9 | 56 | 48.5 | 51.2 | 46.4 | 56.5 | 53.3 | 49.7 | 55.5 | 55 | 62.2 | 77 | 37.5 | 49.2 | 45.4 | 54 | 35.5 | 63.9 | 48.7 | 45.2 | 53.4 | 56.5 | 42.2 |
10 | 🟥 | 🚗 | 89.5 | 88.4 | 90.6 | 93.2 | 83.8 |
1 | 🟥 | ❓ | o1 | 89.5 | 88.4 | 90.6 | 93.2 | 83.8 |
2 | 🟥 | ❓ | o1-mini | 84.8 | 81.2 | 86.2 | 89.7 | 79.5 |
3 | 💙 | 🚗 | 83.2 | 78 | 80 | 84 | 86.7 | |
4 | 💙 | 🚀 | 82.2 | 79.7 | 82 | 88.2 | 76.8 | |
5 | 🟥 | ❓ | gemini-2.0-flash-thinking-exp-01-21 | 81 | 74.3 | 85.8 | 83.3 | 76 |
6 | 🟥 | ❓ | google/gemini-1.5-pro | 80.7 | 78.2 | 79.5 | 83.6 | 77.7 |
7 | 🟥 | ❓ | gpt-4o-2024-08-06 | 77.4 | 77.5 | 72.6 | 81.8 | 74.2 |
8 | 💙 | 🚀 | 76.6 | 76 | 75 | 78.6 | 72.5 | |
9 | 💙 | 🚚 | 75.6 | 73.7 | 74.2 | 79.3 | 70.5 | |
10 | 🟥 | ❓ | google/gemini-1.5-flash | 74.8 | 70.1 | 73.9 | 80.6 | 71.2 |
11 | 🟥 | ❓ | claude-sonnet-3-5 | 74.8 | 72.2 | 73.8 | 77.9 | 70.8 |
12 | 💙 | 🚚 | 74 | 68.2 | 76.8 | 77.3 | 69.3 | |
13 | 🟥 | ❓ | gpt-4o-mini-2024-07-18 | 72.3 | 70.4 | 69.6 | 76.2 | 69.3 |
14 | 💙 | 🚗 | 69.3 | 68.3 | 69.1 | 72.3 | 62.4 | |
15 | 💙 | 🚗 | 61.9 | 57.2 | 63.8 | 63.8 | 59.7 | |
16 | 💙 | 🚚 | 61 | 69.4 | 58.8 | 57 | 56 | |
17 | 💙 | 🚗 | 60.5 | 62.9 | 58.3 | 63.1 | 52.8 | |
18 | 💙 | 🚗 | 52 | 51.2 | 55.5 | 49.2 | 48.7 |
Visible Columns:
10 | 12 | 🟥 | 🚀 | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | DeepSeek | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | 86.8 | 93.1 | 58.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 |
1 | 1 | 🟥 | ❓ | o1 | o1 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | 86.8 | 93.1 | 58.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 |
2 | 5 | 🟥 | ❓ | gemini-2.0-flash-thinking-exp-01-21 | gemini-2.0-flash-thinking-exp-01-21 | Proprietary | null | 150 | Unknown | Unknown | Gemini | null | 83.6 | 89.2 | 58.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 81 | 89.1 | 73.2 | 76.9 | 87.1 |
4 | 4 | 💙 | 🚀 | deepseek-ai/DeepSeek-R1 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | Unknown | DeepSeek | https://huggingface.co/unsloth/DeepSeek-R1 | 80.7 | 91.3 | 33 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 82.2 | 76.8 | 87.7 | 86.2 | 79.1 | |
5 | 2 | 🟥 | ❓ | o1-mini | o1-mini | Proprietary | null | 150 | Unknown | Unknown | GPT | null | 76.3 | 82.9 | 46.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 84.8 | 83.3 | 86.2 | 85.8 | 83.8 |
6 | 3 | 💙 | 🚗 | Qwen/QwQ-32B-Preview | Open-Weights | 32.8 | 32.8 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/QwQ-32B-Preview | 73.1 | 82.7 | 30 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 83.2 | 91.3 | 75.4 | 78.7 | 89.7 | ||
10 | 12 | 💙 | 🚚 | Qwen/Qwen2.5-Math-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-72B-Instruct | 59.5 | 68.7 | 18 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 74 | 80.9 | 66.8 | 73.8 | 75.2 | ||
15 | 9 | 💙 | 🚚 | Qwen/Qwen2.5-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-72B-Instruct | 51.2 | 58.9 | 16.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 75.6 | 77.1 | 74.2 | 77.5 | 73.7 | ||
16 | 7 | 🟥 | ❓ | gpt-4o-2024-08-06 | gpt-4o-2024-08-06 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | 50.2 | 53.9 | 33.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 77.4 | 70.1 | 85.9 | 85.1 | 71.3 |
18 | 8 | 💙 | 🚀 | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 122.6 | 122.6 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | 47.6 | 55.6 | 12 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 76.6 | 75.7 | 77.7 | 79.7 | 73.5 | ||
19 | 15 | 💙 | 🚗 | Qwen/Qwen2.5-Math-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-7B-Instruct | 45.5 | 53 | 11.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 61.9 | 76.6 | 47.9 | 62.9 | 63.9 | ||
21 | 13 | 🟥 | ❓ | gpt-4o-mini-2024-07-18 | gpt-4o-mini-2024-07-18 | Proprietary | null | 150 | Unknown | Unknown | GPT | null | 43.4 | 47.2 | 26 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 72.3 | 59 | 88.1 | 85.1 | 65.1 |
22 | 14 | 💙 | 🚗 | Qwen/Qwen2.5-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-7B-Instruct | 43.3 | 50.4 | 11 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 69.3 | 78.7 | 59.8 | 69.3 | 70.8 | ||
28 | 16 | 💙 | 🚚 | meta-llama/Llama-3.1-70B-Instruct | Open-Weights | 70.6 | 70.6 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-70B-Instruct | 34.3 | 39.6 | 10.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 61 | 62.5 | 59.6 | 64.1 | 57.9 | |
29 | 18 | 💙 | 🚗 | meta-llama/Llama-3.1-8B-Instruct | Open-Weights | 8 | 8 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-8B-Instruct | 29.5 | 33.7 | 11 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 52 | 48.7 | 55.9 | 56 | 48.5 | |
32 | 17 | 💙 | 🚗 | mistralai/Ministral-8B-Instruct-2410 | Open-Weights | 8 | 8 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Ministral-8B-Instruct-2410 | 23.1 | 26.9 | 6 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 60.5 | 55.9 | 65.8 | 65.4 | 56.4 |
10 | 12 | 🟥 | 🚀 | 86.8 | 89.5 |
1 | 1 | 🟥 | ❓ | o1 | 86.8 | 89.5 |
2 | 5 | 🟥 | ❓ | gemini-2.0-flash-thinking-exp-01-21 | 83.6 | 81 |
4 | 4 | 💙 | 🚀 | 80.7 | 82.2 | |
5 | 2 | 🟥 | ❓ | o1-mini | 76.3 | 84.8 |
6 | 3 | 💙 | 🚗 | 73.1 | 83.2 | |
10 | 12 | 💙 | 🚚 | 59.5 | 74 | |
15 | 9 | 💙 | 🚚 | 51.2 | 75.6 | |
16 | 7 | 🟥 | ❓ | gpt-4o-2024-08-06 | 50.2 | 77.4 |
18 | 8 | 💙 | 🚀 | 47.6 | 76.6 | |
19 | 15 | 💙 | 🚗 | 45.5 | 61.9 | |
21 | 13 | 🟥 | ❓ | gpt-4o-mini-2024-07-18 | 43.4 | 72.3 |
22 | 14 | 💙 | 🚗 | 43.3 | 69.3 | |
28 | 16 | 💙 | 🚚 | 34.3 | 61 | |
29 | 18 | 💙 | 🚗 | 29.5 | 52 | |
32 | 17 | 💙 | 🚗 | 23.1 | 60.5 |
This repository contains the official leaderboard code for the U-MATH and $\mu$-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems.
Overview
U-MATH provides a set of 1,100 university-level mathematical problems, while µ-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.
- 📊 U-MATH benchmark at Huggingface
- 🔎 μ-MATH benchmark at Huggingface
- 🗞️ Paper
- 👾 Evaluation Code at GitHub
Licensing Information
- The contents of the μ-MATH's machine-generated
model_output
column are subject to the underlying LLMs' licensing terms. - Contents of all the other dataset U-MATH and μ-MATH fields, as well as the code, are available under the MIT license.