U-MATH / ฮผ-MATH leaderboard
These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems. U-MATH provides a set of 1,100 university-level mathematical problems, while ยต-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.
Visible Columns:
Rank | T | S | Full Model Name | Model Name | Type | #Params (B) | #Params inc. Proprietary (B) | Architecture | License | Family | Model URL | Judge Model Name | U-MATH Acc | U-MATH Text Acc | U-MATH Visual Acc | Diff Calc Acc | Diff Calc Text Acc | Diff Calc Visual Acc | Integral Calc Acc | Integral Calc Text Acc | Integral Calc Visual Acc | Algebra Acc | Algebra Text Acc | Algebra Visual Acc | Multivar Calc Acc | Multivar Calc Text Acc | Multivar Calc Visual Acc | Precalc Acc | Precalc Text Acc | Precalc Visual Acc | Seq & Series Acc | Seq & Series Text Acc | Seq & Series Visual Acc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | ๐ฅ | ๐ | llava-hf/llava-onevision-qwen2-7b-ov-chat-hf | Open-Weights | 684.5 | 684.5 | LlavaOnevisionForConditionalGeneration | DeepSeek | https://huggingface.co/unsloth/llava-onevision-qwen2-7b-ov-chat-hf | gpt-4o-2024-08-06 | 73.6 | 77.8 | 45.2 | 62.3 | 69.3 | 47.1 | 55.3 | 54.7 | 56.9 | 87.2 | 93.3 | 56.7 | 74.2 | 75.3 | 32.1 | 92.5 | 95.3 | 50 | 78.6 | 39.3 | 25 |
Rank | T | S | Model Name | U-MATH Acc | U-MATH Text Acc | U-MATH Visual Acc |
---|---|---|---|---|---|---|
10 | ๐ฅ | ๐ | 73.6 | 77.8 | 45.2 |
Rank | T | S | Model Name | U-MATH Acc | U-MATH Text Acc | U-MATH Visual Acc |
---|---|---|---|---|---|---|
1 | ๐ฅ | โ | gemini-2.0-flash-thinking-exp-01-21 | 73.6 | 77.8 | 55 |
2 | ๐ฅ | โ | o1 | 70.7 | 77.2 | 45.2 |
3 | ๐ฅ | โ | o3-mini | 68.9 | 79.7 | 20.5 |
4 | ๐ | ๐ | 68.4 | 79 | 20.5 | |
5 | ๐ฅ | โ | o1-mini | 63.5 | 73.3 | 19 |
6 | ๐ | ๐ | 61.5 | 71.8 | 15 | |
7 | ๐ฅ | โ | google/gemini-1.5-pro | 60.1 | 63.4 | 45 |
8 | ๐ | ๐ | 55.1 | 59.3 | 36 | |
9 | ๐ | ๐ | 51.9 | 60.4 | 13.5 | |
10 | ๐ฅ | โ | google/gemini-1.5-flash | 51.3 | 53.8 | 40 |
11 | ๐ | ๐ | 50.2 | 59 | 10.5 | |
12 | ๐ | ๐ | 46.2 | 54.6 | 8.5 | |
13 | ๐ | ๐ | 44.1 | 51.3 | 11.5 | |
14 | ๐ | ๐ | 43.8 | 51.4 | 9.5 | |
15 | ๐ฅ | โ | gpt-4o-2024-08-06 | 43.5 | 46.4 | 30 |
16 | ๐ฅ | โ | gpt-4o-2024-05-13 | 43.4 | 45.8 | 32.5 |
17 | ๐ | ๐ | 41 | 48.6 | 7 | |
18 | ๐ | ๐ | 40.4 | 48.1 | 5.5 | |
19 | ๐ | ๐ | 39.7 | 42.9 | 25.5 | |
20 | ๐ | ๐ | 38.4 | 45.2 | 7.5 | |
21 | ๐ | ๐ | 37.3 | 43.4 | 9.5 | |
22 | ๐ฅ | โ | gpt-4o-mini-2024-07-18 | 37.2 | 40.3 | 23 |
23 | ๐ฅ | โ | claude-sonnet-3-5 | 35.1 | 36.1 | 30.5 |
24 | ๐ | ๐ | 33.8 | 40 | 6 | |
25 | ๐ | ๐ | 32.6 | 36.3 | 16 | |
26 | ๐ | ๐ | 31.4 | 37.4 | 4 | |
27 | ๐ | ๐ | 31.2 | 32.2 | 26.5 | |
28 | ๐ | ๐ | 29.5 | 35 | 4.5 | |
29 | ๐ | ๐ | 28.5 | 33.7 | 5 | |
30 | ๐ | ๐ | 25.9 | 30.2 | 6.5 | |
31 | ๐ | ๐ | 25 | 29.7 | 4 | |
32 | ๐ | ๐ | 22.3 | 26.1 | 5 | |
33 | ๐ฅ | โ | liquid/lfm-7b | 21.1 | 24.6 | 5.5 |
34 | ๐ | ๐ | 20.4 | 21.4 | 15.5 | |
35 | ๐ | ๐ | 19.2 | 22.8 | 3 | |
36 | ๐ | ๐ | 18.3 | 21.4 | 4 | |
37 | ๐ | ๐ | 18 | 20.7 | 6 | |
38 | ๐ | ๐ | 17.7 | 20.7 | 4.5 | |
39 | ๐ | ๐ | 17 | 18.6 | 10 | |
40 | ๐ | ๐ | 15.5 | 15.6 | 15.5 | |
41 | ๐ | ๐ | 3.3 | 3.7 | 1.5 |
Visible Columns:
Rank | T | S | Full Model Name | Model Name | Type | #Params (B) | #Params inc. Proprietary (B) | Architecture | License | Family | Model URL | Extract Model Name | ฮผ-MATH F1 | ฮผ-MATH TPR | ฮผ-MATH TNR | ฮผ-MATH PPV | ฮผ-MATH NPV | GPT-4o Subset F1 | GPT-4o Subset TPR | GPT-4o Subset TNR | GPT-4o Subset PPV | GPT-4o Subset NPV | Gemini-1.5-Pro Subset F1 | Gemini-1.5-Pro Subset TPR | Gemini-1.5-Pro Subset TNR | Gemini-1.5-Pro Subset PPV | Gemini-1.5-Pro Subset NPV | Llama-3.1-70B Subset F1 | Llama-3.1-70B Subset TPR | Llama-3.1-70B Subset TNR | Llama-3.1-70B Subset PPV | Llama-3.1-70B Subset NPV | Qwen2.5-72B Subset F1 | Qwen2.5-72B Subset TPR | Qwen2.5-72B Subset TNR | Qwen2.5-72B Subset PPV | Qwen2.5-72B Subset NPV |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | ๐ฅ | ๐ | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | DeepSeek | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 | 88.4 | 88.9 | 87.9 | 88.9 | 87.9 | 90.6 | 95.3 | 84.6 | 91.1 | 91.7 | 93.2 | 90.5 | 95.8 | 90.5 | 95.8 | 83.8 | 86.8 | 80.6 | 84.6 | 83.3 |
Rank | T | S | Full Model Name | Model Name | Type | #Params (B) | #Params inc. Proprietary (B) | Architecture | License | Family | Model URL | Extract Model Name | ฮผ-MATH F1 | ฮผ-MATH TPR | ฮผ-MATH TNR | ฮผ-MATH PPV | ฮผ-MATH NPV | GPT-4o Subset F1 | GPT-4o Subset TPR | GPT-4o Subset TNR | GPT-4o Subset PPV | GPT-4o Subset NPV | Gemini-1.5-Pro Subset F1 | Gemini-1.5-Pro Subset TPR | Gemini-1.5-Pro Subset TNR | Gemini-1.5-Pro Subset PPV | Gemini-1.5-Pro Subset NPV | Llama-3.1-70B Subset F1 | Llama-3.1-70B Subset TPR | Llama-3.1-70B Subset TNR | Llama-3.1-70B Subset PPV | Llama-3.1-70B Subset NPV | Qwen2.5-72B Subset F1 | Qwen2.5-72B Subset TPR | Qwen2.5-72B Subset TNR | Qwen2.5-72B Subset PPV | Qwen2.5-72B Subset NPV |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ๐ฅ | โ | o1 | o1 | Proprietary | 150 | Unknown | Unknown | GPT | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 | 88.4 | 88.9 | 87.9 | 88.9 | 87.9 | 90.6 | 95.3 | 84.6 | 91.1 | 91.7 | 93.2 | 90.5 | 95.8 | 90.5 | 95.8 | 83.8 | 86.8 | 80.6 | 84.6 | 83.3 | ||
2 | ๐ฅ | โ | o1-mini | o1-mini | Proprietary | 150 | Unknown | Unknown | GPT | Qwen/Qwen2.5-72B-Instruct | 84.8 | 83.3 | 86.2 | 85.8 | 83.8 | 81.2 | 77.8 | 84.8 | 84.8 | 77.8 | 86.2 | 88.4 | 84.6 | 90.5 | 81.5 | 89.7 | 85.7 | 93.8 | 85.7 | 93.8 | 79.5 | 81.6 | 77.4 | 81.6 | 77.4 | ||
3 | ๐ | ๐ | Qwen/QwQ-32B-Preview | Open-Weights | 32.8 | 32.8 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/QwQ-32B-Preview | Qwen/Qwen2.5-72B-Instruct | 83.2 | 91.3 | 75.4 | 78.7 | 89.7 | 78 | 86.1 | 69.7 | 75.6 | 82.1 | 80 | 95.3 | 61.5 | 80.4 | 88.9 | 84 | 90.5 | 83.3 | 70.4 | 95.2 | 86.7 | 92.1 | 80.6 | 85.4 | 89.3 | ||
4 | ๐ | ๐ | deepseek-ai/DeepSeek-R1 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | Unknown | DeepSeek | https://huggingface.co/unsloth/DeepSeek-R1 | Qwen/Qwen2.5-72B-Instruct | 82.2 | 76.8 | 87.7 | 86.2 | 79.1 | 79.7 | 72.2 | 87.9 | 86.7 | 74.4 | 82 | 81.4 | 84.6 | 89.7 | 73.3 | 88.2 | 85.7 | 91.7 | 81.8 | 93.6 | 76.8 | 71.1 | 83.9 | 84.4 | 70.3 | |
5 | ๐ฅ | โ | gemini-2.0-flash-thinking-exp-01-21 | gemini-2.0-flash-thinking-exp-01-21 | Proprietary | 150 | Unknown | Unknown | Gemini | Qwen/Qwen2.5-72B-Instruct | 81 | 89.1 | 73.2 | 76.9 | 87.1 | 74.3 | 91.7 | 57.6 | 70.2 | 86.4 | 85.8 | 93 | 76.9 | 87 | 87 | 83.3 | 81 | 87.5 | 73.9 | 91.3 | 76 | 86.8 | 64.5 | 75 | 80 | ||
6 | ๐ฅ | โ | google/gemini-1.5-pro | google/gemini-1.5-pro | Proprietary | 150 | Unknown | Unknown | Gemini | Qwen/Qwen2.5-72B-Instruct | 80.7 | 77.5 | 84.5 | 85.2 | 76.4 | 78.2 | 76.4 | 80.2 | 80.5 | 76.1 | 79.5 | 81 | 82.9 | 91.6 | 65.4 | 83.6 | 75.3 | 90.8 | 82 | 86.8 | 77.7 | 75.5 | 81 | 84.2 | 71.2 | ||
7 | ๐ฅ | โ | gpt-4o-2024-08-06 | gpt-4o-2024-08-06 | Proprietary | 150 | Unknown | Unknown | GPT | Qwen/Qwen2.5-72B-Instruct | 77.4 | 70.1 | 85.9 | 85.1 | 71.3 | 77.5 | 72.1 | 83.2 | 82.1 | 73.6 | 72.6 | 70.4 | 82.9 | 90.5 | 54.8 | 81.8 | 71.1 | 90.8 | 81.2 | 84.9 | 74.2 | 67.1 | 83.6 | 84.6 | 65.5 | ||
8 | ๐ | ๐ | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 122.6 | 122.6 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | Qwen/Qwen2.5-72B-Instruct | 76.6 | 75.7 | 77.7 | 79.7 | 73.5 | 76 | 75.7 | 76.3 | 77.4 | 74.6 | 75 | 79.9 | 73.2 | 87.3 | 61.2 | 78.6 | 75.3 | 82.8 | 70.9 | 85.7 | 72.5 | 71 | 75 | 79.1 | 65.9 | ||
9 | ๐ | ๐ | Qwen/Qwen2.5-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-72B-Instruct | Qwen/Qwen2.5-72B-Instruct | 75.6 | 77.1 | 74.2 | 77.5 | 73.7 | 73.7 | 76.4 | 71 | 73.8 | 73.8 | 74.2 | 79.4 | 72 | 86.7 | 60.2 | 79.3 | 75.3 | 83.9 | 72.3 | 85.9 | 70.5 | 76.1 | 64.7 | 74.2 | 67 | ||
10 | ๐ฅ | โ | google/gemini-1.5-flash | google/gemini-1.5-flash | Proprietary | 150 | Unknown | Unknown | Gemini | Qwen/Qwen2.5-72B-Instruct | 74.8 | 63.3 | 88.3 | 86.2 | 67.6 | 70.1 | 57.9 | 84 | 79.4 | 65.1 | 73.9 | 67.7 | 91.5 | 94.8 | 55.1 | 80.6 | 67 | 92 | 82.3 | 83.3 | 71.2 | 60.6 | 85.3 | 84.7 | 61.9 | ||
11 | ๐ฅ | โ | claude-sonnet-3-5 | claude-sonnet-3-5 | Proprietary | 150 | Unknown | Unknown | Claude | Qwen/Qwen2.5-72B-Instruct | 74.8 | 62.5 | 89.5 | 87.3 | 67.4 | 72.2 | 57.9 | 88.5 | 84.4 | 66.3 | 73.8 | 70.9 | 85.4 | 91.8 | 56 | 77.9 | 59.8 | 93.1 | 82.9 | 80.6 | 70.8 | 58.1 | 87.9 | 86.5 | 61.1 | ||
12 | ๐ | ๐ | Qwen/Qwen2.5-Math-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-72B-Instruct | Qwen/Qwen2.5-72B-Instruct | 74 | 80.9 | 66.8 | 73.8 | 75.2 | 68.2 | 77.9 | 58.8 | 66.9 | 71.3 | 76.8 | 82.5 | 73.2 | 87.6 | 64.5 | 77.3 | 81.4 | 76.4 | 65.8 | 88.1 | 69.3 | 81.3 | 56.9 | 71.6 | 69.5 | ||
13 | ๐ฅ | โ | gpt-4o-mini-2024-07-18 | gpt-4o-mini-2024-07-18 | Proprietary | 150 | Unknown | Unknown | GPT | Qwen/Qwen2.5-72B-Instruct | 72.3 | 59 | 88.1 | 85.1 | 65.1 | 70.4 | 56.4 | 86.3 | 81.4 | 64.9 | 69.6 | 63 | 87.8 | 92.2 | 50.7 | 76.2 | 59.8 | 90.2 | 77.3 | 80.1 | 69.3 | 56.1 | 87.1 | 85.3 | 59.8 | ||
14 | ๐ | ๐ | Qwen/Qwen2.5-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-7B-Instruct | Qwen/Qwen2.5-72B-Instruct | 69.3 | 78.7 | 59.8 | 69.3 | 70.8 | 68.3 | 81.4 | 55.7 | 66.3 | 73.7 | 69.1 | 79.4 | 59.8 | 82 | 55.7 | 72.3 | 78.4 | 70.1 | 59.4 | 85.3 | 62.4 | 75.5 | 49.1 | 66.5 | 60 | ||
15 | ๐ | ๐ | Qwen/Qwen2.5-Math-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-7B-Instruct | Qwen/Qwen2.5-72B-Instruct | 61.9 | 76.6 | 47.9 | 62.9 | 63.9 | 57.2 | 75 | 41.2 | 57.7 | 60.7 | 63.8 | 77.8 | 50 | 78.2 | 49.4 | 63.8 | 85.6 | 51.7 | 49.7 | 86.5 | 59.7 | 71 | 48.3 | 64.7 | 55.4 | ||
16 | ๐ | ๐ | meta-llama/Llama-3.1-70B-Instruct | Open-Weights | 70.6 | 70.6 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-70B-Instruct | Qwen/Qwen2.5-72B-Instruct | 61 | 62.5 | 59.6 | 64.1 | 57.9 | 69.4 | 67.1 | 71.8 | 71.8 | 67.1 | 58.8 | 61.4 | 61 | 78.4 | 40.7 | 57 | 63.9 | 54 | 43.7 | 72.9 | 56 | 58.7 | 53.4 | 62.8 | 49.2 | |
17 | ๐ | ๐ | mistralai/Ministral-8B-Instruct-2410 | Open-Weights | 8 | 8 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Ministral-8B-Instruct-2410 | Qwen/Qwen2.5-72B-Instruct | 60.5 | 55.9 | 65.8 | 65.4 | 56.4 | 62.9 | 53.6 | 73.3 | 68.2 | 59.6 | 58.3 | 63 | 57.3 | 77.3 | 40.2 | 63.1 | 59.8 | 67.8 | 50.9 | 75.2 | 52.8 | 47.1 | 60.3 | 61.3 | 46.1 | ||
18 | ๐ | ๐ | meta-llama/Llama-3.1-8B-Instruct | Open-Weights | 8 | 8 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-8B-Instruct | Qwen/Qwen2.5-72B-Instruct | 52 | 48.7 | 55.9 | 56 | 48.5 | 51.2 | 46.4 | 56.5 | 53.3 | 49.7 | 55.5 | 55 | 62.2 | 77 | 37.5 | 49.2 | 45.4 | 54 | 35.5 | 63.9 | 48.7 | 45.2 | 53.4 | 56.5 | 42.2 |
Rank | T | S | Model Name | ฮผ-MATH F1 | GPT-4o Subset F1 | Gemini-1.5-Pro Subset F1 | Llama-3.1-70B Subset F1 | Qwen2.5-72B Subset F1 |
---|---|---|---|---|---|---|---|---|
10 | ๐ฅ | ๐ | 89.5 | 88.4 | 90.6 | 93.2 | 83.8 |
Rank | T | S | Model Name | ฮผ-MATH F1 | GPT-4o Subset F1 | Gemini-1.5-Pro Subset F1 | Llama-3.1-70B Subset F1 | Qwen2.5-72B Subset F1 |
---|---|---|---|---|---|---|---|---|
1 | ๐ฅ | โ | o1 | 89.5 | 88.4 | 90.6 | 93.2 | 83.8 |
2 | ๐ฅ | โ | o1-mini | 84.8 | 81.2 | 86.2 | 89.7 | 79.5 |
3 | ๐ | ๐ | 83.2 | 78 | 80 | 84 | 86.7 | |
4 | ๐ | ๐ | 82.2 | 79.7 | 82 | 88.2 | 76.8 | |
5 | ๐ฅ | โ | gemini-2.0-flash-thinking-exp-01-21 | 81 | 74.3 | 85.8 | 83.3 | 76 |
6 | ๐ฅ | โ | google/gemini-1.5-pro | 80.7 | 78.2 | 79.5 | 83.6 | 77.7 |
7 | ๐ฅ | โ | gpt-4o-2024-08-06 | 77.4 | 77.5 | 72.6 | 81.8 | 74.2 |
8 | ๐ | ๐ | 76.6 | 76 | 75 | 78.6 | 72.5 | |
9 | ๐ | ๐ | 75.6 | 73.7 | 74.2 | 79.3 | 70.5 | |
10 | ๐ฅ | โ | google/gemini-1.5-flash | 74.8 | 70.1 | 73.9 | 80.6 | 71.2 |
11 | ๐ฅ | โ | claude-sonnet-3-5 | 74.8 | 72.2 | 73.8 | 77.9 | 70.8 |
12 | ๐ | ๐ | 74 | 68.2 | 76.8 | 77.3 | 69.3 | |
13 | ๐ฅ | โ | gpt-4o-mini-2024-07-18 | 72.3 | 70.4 | 69.6 | 76.2 | 69.3 |
14 | ๐ | ๐ | 69.3 | 68.3 | 69.1 | 72.3 | 62.4 | |
15 | ๐ | ๐ | 61.9 | 57.2 | 63.8 | 63.8 | 59.7 | |
16 | ๐ | ๐ | 61 | 69.4 | 58.8 | 57 | 56 | |
17 | ๐ | ๐ | 60.5 | 62.9 | 58.3 | 63.1 | 52.8 | |
18 | ๐ | ๐ | 52 | 51.2 | 55.5 | 49.2 | 48.7 |
Visible Columns:
U-MATH Rank | ฮผ-MATH Rank | T | S | Full Model Name | Model Name | Type | #Params (B) | #Params inc. Proprietary (B) | Architecture | License | Family | Model URL | U-MATH Acc | U-MATH Text Acc | U-MATH Visual Acc | Judge Model Name | Extract Model Name | ฮผ-MATH F1 | ฮผ-MATH TPR | ฮผ-MATH TNR | ฮผ-MATH PPV | ฮผ-MATH NPV |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 10 | ๐ฅ | ๐ | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | DeepSeek | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | 73.6 | 77.8 | 45.2 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 89.1 | 73.2 | 76.9 | 87.1 |
U-MATH Rank | ฮผ-MATH Rank | T | S | Full Model Name | Model Name | Type | #Params (B) | #Params inc. Proprietary (B) | Architecture | License | Family | Model URL | U-MATH Acc | U-MATH Text Acc | U-MATH Visual Acc | Judge Model Name | Extract Model Name | ฮผ-MATH F1 | ฮผ-MATH TPR | ฮผ-MATH TNR | ฮผ-MATH PPV | ฮผ-MATH NPV |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5 | ๐ฅ | โ | gemini-2.0-flash-thinking-exp-01-21 | gemini-2.0-flash-thinking-exp-01-21 | Proprietary | 150 | Unknown | Unknown | Gemini | 73.6 | 77.8 | 55 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 81 | 89.1 | 73.2 | 76.9 | 87.1 | ||
2 | 1 | ๐ฅ | โ | o1 | o1 | Proprietary | 150 | Unknown | Unknown | GPT | 70.7 | 77.2 | 45.2 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 89.5 | 90.6 | 88.4 | 88.7 | 90.4 | ||
4 | 4 | ๐ | ๐ | deepseek-ai/DeepSeek-R1 | Open-Weights | 684.5 | 684.5 | DeepseekV3ForCausalLM | Unknown | DeepSeek | https://huggingface.co/unsloth/DeepSeek-R1 | 68.4 | 79 | 20.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 82.2 | 76.8 | 87.7 | 86.2 | 79.1 | |
5 | 2 | ๐ฅ | โ | o1-mini | o1-mini | Proprietary | 150 | Unknown | Unknown | GPT | 63.5 | 73.3 | 19 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 84.8 | 83.3 | 86.2 | 85.8 | 83.8 | ||
6 | 3 | ๐ | ๐ | Qwen/QwQ-32B-Preview | Open-Weights | 32.8 | 32.8 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/QwQ-32B-Preview | 61.5 | 71.8 | 15 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 83.2 | 91.3 | 75.4 | 78.7 | 89.7 | ||
7 | 6 | ๐ฅ | โ | google/gemini-1.5-pro | google/gemini-1.5-pro | Proprietary | 150 | Unknown | Unknown | Gemini | 60.1 | 63.4 | 45 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 80.7 | 77.5 | 84.5 | 85.2 | 76.4 | ||
10 | 10 | ๐ฅ | โ | google/gemini-1.5-flash | google/gemini-1.5-flash | Proprietary | 150 | Unknown | Unknown | Gemini | 51.3 | 53.8 | 40 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 74.8 | 63.3 | 88.3 | 86.2 | 67.6 | ||
11 | 12 | ๐ | ๐ | Qwen/Qwen2.5-Math-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-72B-Instruct | 50.2 | 59 | 10.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 74 | 80.9 | 66.8 | 73.8 | 75.2 | ||
15 | 7 | ๐ฅ | โ | gpt-4o-2024-08-06 | gpt-4o-2024-08-06 | Proprietary | 150 | Unknown | Unknown | GPT | 43.5 | 46.4 | 30 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 77.4 | 70.1 | 85.9 | 85.1 | 71.3 | ||
17 | 9 | ๐ | ๐ | Qwen/Qwen2.5-72B-Instruct | Open-Weights | 72.7 | 72.7 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-72B-Instruct | 41 | 48.6 | 7 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 75.6 | 77.1 | 74.2 | 77.5 | 73.7 | ||
18 | 8 | ๐ | ๐ | mistralai/Mistral-Large-Instruct-2411 | Open-Weights | 122.6 | 122.6 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Mistral-Large-Instruct-2411 | 40.4 | 48.1 | 5.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 76.6 | 75.7 | 77.7 | 79.7 | 73.5 | ||
20 | 15 | ๐ | ๐ | Qwen/Qwen2.5-Math-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-Math-7B-Instruct | 38.4 | 45.2 | 7.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 61.9 | 76.6 | 47.9 | 62.9 | 63.9 | ||
22 | 13 | ๐ฅ | โ | gpt-4o-mini-2024-07-18 | gpt-4o-mini-2024-07-18 | Proprietary | 150 | Unknown | Unknown | GPT | 37.2 | 40.3 | 23 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 72.3 | 59 | 88.1 | 85.1 | 65.1 | ||
23 | 11 | ๐ฅ | โ | claude-sonnet-3-5 | claude-sonnet-3-5 | Proprietary | 150 | Unknown | Unknown | Claude | 35.1 | 36.1 | 30.5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 74.8 | 62.5 | 89.5 | 87.3 | 67.4 | ||
24 | 14 | ๐ | ๐ | Qwen/Qwen2.5-7B-Instruct | Open-Weights | 7.6 | 7.6 | Qwen2ForCausalLM | Qwen | https://huggingface.co/unsloth/Qwen2.5-7B-Instruct | 33.8 | 40 | 6 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 69.3 | 78.7 | 59.8 | 69.3 | 70.8 | ||
29 | 16 | ๐ | ๐ | meta-llama/Llama-3.1-70B-Instruct | Open-Weights | 70.6 | 70.6 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-70B-Instruct | 28.5 | 33.7 | 5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 61 | 62.5 | 59.6 | 64.1 | 57.9 | |
32 | 18 | ๐ | ๐ | meta-llama/Llama-3.1-8B-Instruct | Open-Weights | 8 | 8 | LlamaForCausalLM | Unknown | LLaMA | https://huggingface.co/unsloth/Llama-3.1-8B-Instruct | 22.3 | 26.1 | 5 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 52 | 48.7 | 55.9 | 56 | 48.5 | |
36 | 17 | ๐ | ๐ | mistralai/Ministral-8B-Instruct-2410 | Open-Weights | 8 | 8 | MistralForCausalLM | Mistral | https://huggingface.co/unsloth/Ministral-8B-Instruct-2410 | 18.3 | 21.4 | 4 | gpt-4o-2024-08-06 | Qwen/Qwen2.5-72B-Instruct | 60.5 | 55.9 | 65.8 | 65.4 | 56.4 |
U-MATH Rank | ฮผ-MATH Rank | T | S | Model Name | U-MATH Acc | ฮผ-MATH F1 |
---|---|---|---|---|---|---|
10 | 10 | ๐ฅ | ๐ | 73.6 | 89.5 |
U-MATH Rank | ฮผ-MATH Rank | T | S | Model Name | U-MATH Acc | ฮผ-MATH F1 |
---|---|---|---|---|---|---|
1 | 5 | ๐ฅ | โ | gemini-2.0-flash-thinking-exp-01-21 | 73.6 | 81 |
2 | 1 | ๐ฅ | โ | o1 | 70.7 | 89.5 |
4 | 4 | ๐ | ๐ | 68.4 | 82.2 | |
5 | 2 | ๐ฅ | โ | o1-mini | 63.5 | 84.8 |
6 | 3 | ๐ | ๐ | 61.5 | 83.2 | |
7 | 6 | ๐ฅ | โ | google/gemini-1.5-pro | 60.1 | 80.7 |
10 | 10 | ๐ฅ | โ | google/gemini-1.5-flash | 51.3 | 74.8 |
11 | 12 | ๐ | ๐ | 50.2 | 74 | |
15 | 7 | ๐ฅ | โ | gpt-4o-2024-08-06 | 43.5 | 77.4 |
17 | 9 | ๐ | ๐ | 41 | 75.6 | |
18 | 8 | ๐ | ๐ | 40.4 | 76.6 | |
20 | 15 | ๐ | ๐ | 38.4 | 61.9 | |
22 | 13 | ๐ฅ | โ | gpt-4o-mini-2024-07-18 | 37.2 | 72.3 |
23 | 11 | ๐ฅ | โ | claude-sonnet-3-5 | 35.1 | 74.8 |
24 | 14 | ๐ | ๐ | 33.8 | 69.3 | |
29 | 16 | ๐ | ๐ | 28.5 | 61 | |
32 | 18 | ๐ | ๐ | 22.3 | 52 | |
36 | 17 | ๐ | ๐ | 18.3 | 60.5 |
This repository contains the official leaderboard code for the U-MATH and $\mu$-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of Large Language Models (LLMs) on university-level problems.
Overview
U-MATH provides a set of 1,100 university-level mathematical problems, while ยต-MATH complements it with a meta-evaluation framework focusing on solution judgment with 1084 LLM solutions.
- ๐ U-MATH benchmark at Huggingface
- ๐ ฮผ-MATH benchmark at Huggingface
- ๐๏ธ Paper
- ๐พ Evaluation Code at GitHub
Licensing Information
- The contents of the ฮผ-MATH's machine-generated
model_output
column are subject to the underlying LLMs' licensing terms. - Contents of all the other dataset U-MATH and ฮผ-MATH fields, as well as the code, are available under the MIT license.