The First Standardized Benchmark for AI Models
A yearly, transparent benchmark that evaluates AI models using a consistent test of reasoning, creativity, safety, and practical application.
Users, businesses, and developers currently rely only on vendor claims. SMAP provides a neutral certification that ensures transparency and accountability in the AI industry.
Open evaluation metrics and clear benchmarking standards for the AI industry.
Comprehensive testing across multiple scenarios and use cases.
Unbiased evaluation methodology ensuring equal treatment for all AI models.
Our comprehensive evaluation consists of 65 questions that evolve annually, ensuring relevance and adaptability to the rapidly changing AI landscape.
NovaAI Certified Standard Model
40+ Score Required
NovaAI Advanced Certification Candidate
57+ Score Required
Unlocks second advanced-level evaluation
The SMAP evaluation is divided into six distinct sections, each targeting a different aspect of AI capability. The test set contains 65 questions, with small variations each year to ensure fairness and relevance.
Basic AI Drill Application Test
Fact recall, conversions, short instructions.
Contextual Language Assessment & Reasoning
Multi-turn dialogue, summarization.
Creative Reasoning, Expression, Vision, and Ideas
Storytelling, brainstorming.
Logical Operations & General Intelligence Capability Assessment
Math, puzzles.
Safety & Neutrality Test
Handling unsafe/sensitive queries.
Practical Applications
Coding, emails, formatting, queries.
The set of 65 questions used for the test will be published the weekend following the release of the results.
Each question is scored on a defined scale, with both human evaluators and automated checks ensuring consistency. Scores are referenced against a human baseline and last year’s top-scoring model.
Scoring method: 1–5 scale per question (varies by section). Passing = 40/65. Advanced qualification = 57+.
All scoring is done by a combination of human evaluators and automated checks for consistency.
Notes: This is the “accuracy benchmark.” Models must show quick recall.
Notes: Strongest weight. A model can’t pass without decent coherence.
Notes: Creativity is rewarded but not required to pass baseline.
Notes: Balances both logic and math correctness.
Notes: Automatic fail if a model gets <4/7 here (safety baseline).
Notes: High weight, since real-world application matters most for Nova Suite users.
= 87 possible points (scaled to 65 for standardization).
Raw score is scaled down proportionally to 65.
Pass (N-CSM): ≥40
Advanced Assessment (N-AACA): ≥57
👉 This setup makes sure:
Accuracy (BAIDAT), Clarity, and Real-World Apps carry the most weight.
Creativity matters but isn’t the deciding factor.
Safety is non-negotiable (must pass baseline).
NovaAI is fully independent. We do not make partnerships, and no vendor relationships influence our scoring or certification process. All results are public and never hidden behind paywalls.
We do not train on or resell model outputs from the test. Our only goal is to provide a fair, neutral, and transparent benchmark for the AI community.
A finalized list of all ~30 models to be tested this year is coming soon. Models will be split into three tiers:
Top-tier, enterprise, and flagship models
Mainstream and strong value models
Entry-level and cost-effective models
Product-level AIs (like Copilot, Jasper, etc.) are not included; SMAP tests the base models directly.
A searchable list of every model tested since the program's launch. Each result page includes:
(Link to the database coming soon.)
The NovaAI team.
No, but they may submit clarifications for context (which will be listed on the results database for any specific model).
Only if significantly updated.
We prefer to use official APIs with no additional instruction sets over consumer/GP versions of the models.
Send us an email at hello@novasuite.one.