Evaluation

HOT

Projects and tools for evaluating AI and ML models.

Active projects 74
New this week +74
Total star growth +284
Cross-source 4
266.5k
Total Stars
32.5k
Total Forks
5
Multi-Source Repos
+284
Stars This Period

Top Projects (74)

ML

mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

Trend 21
agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering
25.2k 5.5k +23/wk
GitHub PyPI 2-source
LA

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Trend 20
analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator
24.6k 2.5k +76/wk
GitHub HuggingFace PyPI 3-source
OP

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Trend 18
evaluation hacktoberfest hacktoberfest2025 langchain llama-index llm llm-evaluation llm-observability llmops open-source openai playground prompt-engineering
18.7k 1.4k +20/wk
GitHub PyPI 2-source
OB

BlazeUp-AI/Observal

Observal is an observability platform and local registry for MCPs, hooks, skills, graphRAGs and more!

Trend 4
cli-tool evaluation large-language-models llm llm-evaluation llm-observability llmops monitoring observability open-source playground self-hosted
148 14 +4/wk
GitHub
RE

InternScience/ResearchClawBench

ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

Trend 4
agent ai ai-agent ai-scientist ai4science auto-research benchmark claude claude-code clawdbot codex discovery end-to-end evaluation llm openai openclaw research-claw science
66 5 +3/wk
GitHub
AR

arklexai/arksim

Find your agents errors be fore your real users do

Trend 3
agents ai chatbot conversational-ai evaluation llm open-source python simulation testing
141 11 +0/wk
GitHub
ME

memvid/memvid

Memory layer for AI Agents. Replace complex RAG pipelines with a serverless, single-file memory layer. Give your agents instant retrieval and long-term memory.

Trend 3
ai context embedded faiss knowledge-base knowledge-graph llm machine-learning memory memvid mv2 nlp offline-first opencv python rag retrieval-augmented-generation semantic-search vector-database video-processing
14.7k 1.3k +56/wk
GitHub
OU

Q00/ouroboros

Stop prompting. Start specifying.

Trend 3
ai-agent claude-code codex-cli devtools evaluation llm mcp multi-agent prompt-engineering python spec-driven-development workflow-automation
2.1k 199 +12/wk
GitHub
KE

ScalingIntelligence/KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Trend 3
benchmark codegen evaluation gpu rl-environment tooling
916 150 +4/wk
GitHub
PR

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Trend 3
ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners
19.8k 1.7k +45/wk
GitHub PyPI 2-source
RA

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

Trend 3
evaluation llm llmops
13.3k 1.3k +17/wk
GitHub
LM

lmnr-ai/lmnr

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Trend 3
agent-observability agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llmops monitoring observability open-source rust rust-lang self-hosted ts typescript
2.8k 191 +3/wk
GitHub
EV

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Trend 3
evaluation llm performance rag vlm
2.6k 301 +4/wk
GitHub
GR

1517005260/graph-rag-agent

拼好RAG:手搓并融合了GraphRAG、LightRAG、Neo4j-llm-graph-builder进行知识图谱构建以及搜索;整合DeepSearch技术实现私域RAG的推理;自制针对GraphRAG的评估框架| Integrate GraphRAG, LightRAG, and Neo4j-llm-graph-builder for knowledge graph construction and search. Combine DeepSearch for private RAG reasoning. Create a custom evaluation framework for GraphRAG.

Trend 3
agentic-rag chain-of-exploration deepresearch deepsearch evaluation graphrag graphsearch kg lightrag reasoning think-on-graph
2.1k 283 +4/wk
GitHub
AL

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

Trend 3
agent awsome-list benchmark blogs compress evaluation large-language-models length-extrapolation llm long-context-modeling long-term-memory longcot papers rag ssm survey transformer
2.0k 83 +1/wk
GitHub
LE

MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Trend 3
benchmark evaluation large-language-models llm llms model-assessment
1.6k 99 +1/wk
GitHub
MM

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Trend 3
computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality natural-language-processing question-answering stem visual-question-answering
553 50 +1/wk
GitHub
OP

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Trend 3
agent agent-skills ai-agent alignment evaluation grader llm reward reward-model rlhf skill-md skills
531 45 +3/wk
GitHub
MA

facebookresearch/meta-agents-research-environments

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.

Trend 3
agents ai autonomous-agents benchmark evaluation large-language-models llm meta multi-agent-systems natural-language-processing reinforcement-learning rl simulation
469 63 +1/wk
GitHub
UA

OpenBMB/UltraEval-Audio

Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。

Trend 3
evaluation speech-recognition speech-to-speech speech-to-text
286 21 +0/wk
GitHub
EV

strands-agents/evals

A comprehensive evaluation framework for AI agents and LLM applications.

Trend 3
agentic agentic-ai ai evaluation machine-learning python strands-agents
102 27 +0/wk
GitHub
LA

gil-son/language-ai-engineering-lab

Language AI Engineering Lab, a place where you can deeply understand and build modern Language AI systems, from fundamentals to production.

Trend 3
context-engineering embeedings evaluation generative-ai langchain llm nlg nlp nlu ollama openai prompt-engineering rag tokenization transformers
94 19 +1/wk
GitHub
WE

Tencent/WeKnora

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

Trend 3
agent agentic ai chatbot chatbots embeddings evaluation generative-ai golang knowledge-base llm multi-tenant multimodel ollama openai question-answering rag reranking semantic-search vector-search
13.8k 1.6k +12/wk
GitHub
TE

tensorzero/tensorzero

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

Trend 3
ai ai-engineering anthropic artificial-intelligence deep-learning genai generative-ai gpt large-language-models llama llm llmops llms machine-learning ml ml-engineering mlops openai python rust
11.2k 806 -3/wk
GitHub
OU

oumi-ai/oumi

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Trend 3
dpo evaluation fine-tuning gpt-oss gpt-oss-120b gpt-oss-20b inference llama llms sft slms vlms
9.2k 744 +2/wk
GitHub
OP

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Trend 3
benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai
6.8k 755 +2/wk
GitHub
HE

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Trend 3
agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator
5.5k 507 +1/wk
GitHub
CL

coze-dev/coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

Trend 3
agent agent-evaluation agent-observability agentops ai coze eino evaluation langchain llm-observability llmops monitoring observability open-source openai playground prompt-management
5.4k 745 +4/wk
GitHub
KI

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Trend 3
ai chain-of-thought collaboration dataset-generation evals evaluation evaluation-framework fine-tuning machine-learning macos mcp ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows
4.7k 352 -1/wk
GitHub
AU

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Trend 3
analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation
4.7k 389 -1/wk
GitHub
AG

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Trend 3
agents evaluation llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform llm-playground llm-tools llmops observability prompt-engineering prompt-management rag-evaluation
4.0k 506 +0/wk
GitHub
VL

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Trend 3
chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa
4.0k 672 +2/wk
GitHub
LE

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Trend 3
agi audio-evaluation benchmark evaluation large-language-models llm-evaluation multimodal multimodal-evaluation video-understanding vision-language-model vlm
4.0k 557 +0/wk
GitHub
LA

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

Trend 3
ai analytics datasets dspy evaluation gpt llm llm-ops llmops low-code observability openai prompt-engineering
3.2k 307 -1/wk
GitHub
PR

microsoftarchive/promptbench

A unified evaluation framework for large language models

Trend 3
adversarial-attacks benchmark chatgpt evaluation large-language-models prompt prompt-engineering robustness
2.8k 219 -1/wk
GitHub
EV

huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Trend 3
evaluation machine-learning
2.4k 313 +0/wk
GitHub
LI

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Trend 3
evaluation evaluation-framework evaluation-metrics huggingface
2.4k 449 +0/wk
GitHub
UP

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Trend 3
autoevaluation evaluation experimentation hallucination-detection jailbreak-detection llm-eval llm-prompting llm-test llmops machine-learning monitoring openai-evals prompt-engineering root-cause-analysis
2.3k 204 +0/wk
GitHub
EG

huggingface/evaluation-guidebook

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

Trend 3
evaluation evaluation-metrics guidebook large-language-models llm machine-learning tutorial
2.1k 122 +1/wk
GitHub
AB

xinshuoweng/AB3DMOT

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

Trend 3
2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking
1.8k 416 +1/wk
GitHub
WF

onestardao/WFGY

WFGY is an open-source AI Troubleshooting Atlas for RAG, agents, and real-world AI workflows. Includes the 16-problem map, Global Debug Card, and WFGY 4.0. ⭐ Star to help more builders find this repo.

Trend 3
ai-agents alignment debugging evaluation graphrag hallucination information-retrieval knowledge-graph llm rag reasoning retrieval-augmented-generation
1.7k 161 +0/wk
GitHub
KU

run-house/kubetorch

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

Trend 3
artificial-intelligence aws data-processing data-science distributed evaluation gcp inference infrastructure kubernetes machine-learning observability python pytorch ray serverless training
1.2k 53 +0/wk
GitHub
TF

toshas/torch-fidelity

High-fidelity performance metrics for generative models in PyTorch

Trend 3
evaluation frechet-inception-distance gan generative-model inception-score kernel-inception-distance metrics perceptual-path-length precision pytorch reproducibility reproducible-research
1.2k 87 +1/wk
GitHub
DE

always-further/deepfabric

Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline

Trend 3
agents ai data-science dataset distillation evaluation fine-tuning huggingface huggingface-datasets machine-learning open open-source python source synthetic synthetic-data unsloth
851 80 +0/wk
GitHub
WC

angular/web-codegen-scorer

Web Codegen Scorer is a tool for evaluating the quality of web code generated by LLMs.

Trend 3
benchmarking codegen evaluation llm-coding
711 60 +0/wk
GitHub
LI

ModelTC/LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Trend 3
awq benchmark deepseek-v3 deployment evaluation internlm2 large-language-models llm mixtral pruning quantization smoothquant token-merging token-pruning token-reduction tool vllm wan
698 77 +0/wk
GitHub
AL

onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

Trend 3
awsome-list awsome-lists benchmark bert chatglm chatgpt dataset evaluation gpt3 large-language-model leaderboard llama llm llm-evaluation machine-learning nlp openai qwen rag
630 54 +1/wk
GitHub
TR

HowieHwong/TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Trend 3
ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning
623 67 +0/wk
GitHub
TD

jkkummerfeld/text2sql-data

A collection of datasets that pair questions with SQL queries.

Trend 3
database dataset dynet evaluation natural-language-interface natural-language-processing neural-network nlp sql
587 116 +0/wk
GitHub
AF

SAILResearch/awesome-foundation-model-leaderboards

A curated list of awesome leaderboard-oriented resources for AI domain

Trend 3
ai-agent artificial-intelligence awesome-list benchmark deep-learning evaluation foundation-model large-ai-model leaderboard machine-learning ranking-system
337 40 -1/wk
GitHub
FM

zalandoresearch/fashion-mnist

A MNIST-like fashion product database. Benchmark :point_down:

Trend 0
benchmark computer-vision convolutional-neural-networks dataset deep-learning fashion fashion-mnist gan machine-learning mnist zalando
12.7k 3.1k -2/wk
GitHub
CH

ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

Trend 0
ai evaluation large-language-models llmops llms prompt-engineering
3.0k 254 -2/wk
GitHub
AS

zzw922cn/Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

Trend 0
audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset
2.8k 534 +0/wk
GitHub
EV

Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

Trend 0
ai ai-challenges angularjs artificial-intelligence challenge codecov coveralls django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research travis-ci
2.0k 987 +0/wk
GitHub
AE

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Trend 0
deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf
2.0k 306 +0/wk
GitHub
EM

SmartFlowAI/EmoLLM

心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models

Trend 0
dataset depoly evaluation llm post-training the-big-model-of-mental-health
1.7k 218 -2/wk
GitHub
RA

deepsense-ai/ragbits

Building blocks for rapid development of GenAI applications

Trend 0
agents document-search evaluation guardrails llms optimization prompts rag vector-stores
1.6k 135 -1/wk
GitHub
PY

sepandhaghighi/pycm

Multi-class confusion matrix library in Python

Trend 0
accuracy ai artificial-intelligence classification confusion-matrix data data-analysis data-mining data-science deep-learning deeplearning evaluation machine-learning mathematics matrix ml multiclass-classification neural-network statistical-analysis statistics
1.5k 123 +0/wk
GitHub
NE

Maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Trend 0
bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue
1.4k 226 +0/wk
GitHub
XA

EthicalML/xai

XAI - An eXplainability toolbox for machine learning

Trend 0
ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library
1.2k 185 +0/wk
GitHub
SK

PRBonn/semantic-kitti-api

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

Trend 0
dataset deep-learning evaluation labels large-scale-dataset machine-learning semantic-scene-completion semantic-segmentation
894 195 +0/wk
GitHub
RF

IntelLabs/RAG-FiT

Framework for enhancing LLMs for RAG tasks using fine-tuning.

Trend 0
evaluation fine-tuning information-retrieval llm nlp question-answering rag semantic-search
769 62 +0/wk
GitHub
LF

google-deepmind/long-form-factuality

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

Trend 0
benchmark dataset evaluation factuality language language-modeling large-language-models metrics
678 82 +0/wk
GitHub
AU

ucinlp/autoprompt

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

Trend 0
evaluation language-model nlp
640 87 +0/wk
GitHub
ST

SeekingDream/Static-to-Dynamic-LLMEval

The official GitHub repository of the paper "Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation"

Trend 0
benchmark dynamic-evaluation evaluation large-language-model llm llms testing
516 40 -7/wk
GitHub
RE

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

Trend 0
algorithms collaborative-filtering deep-learning distributed-computing evaluation machine-learning matrix-factorization pyspark pytorch recommendation-algorithms recommender-system recsys transformers
394 38 +0/wk
GitHub
RE

microsoft/rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

Trend 0
acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors
299 107 +0/wk
GitHub
AR

Ayanami0730/arag

A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.

Trend 0
agent agentic-ai agenticrag deepresearch evaluation graphrag llm llmagents rag reproduce
240 30 +0/wk
GitHub
NC

lechmazur/nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

Trend 0
benchmark claude evaluation gemini-pro gpt-5 grok4 llm llms-benchmarking puzzles reasoning testing
200 8 +0/wk
GitHub
EV

ai-twinkle/Eval

High-performance LLM evaluation framework with parallel API calls — up to 17× faster than sequential tools. Supports box, math, and logit-based evaluation.

Trend 0
eval evaluation llm
93 16 +0/wk
GitHub
EV

hidai25/eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

Trend 0
agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing
80 17 +0/wk
GitHub
SC

InternScience/SciEvalKit

A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.

Trend 0
agent ai ai4science code-generation evaluation evaluation-framework gemini gpt llm llm-evaluation vllm
79 10 +0/wk
GitHub
AE

arthur-ai/arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

Trend 0
agentic benchmarking evaluation genai guardrails llm ml monitoring tracing
74 10 +0/wk
GitHub
EV

dustalov/evalica

Evalica, your favourite evaluation toolkit

Trend 0
New Signal
arena bradley-terry elo evalica evals evaluation hacktoberfest leaderboard library llm pagerank pairwise-comparison pyo3 python ranking rating rust serbia statistics winrate
62 5 +0/wk
GitHub

Source Breakdown

GitHub
Stars266.5k
Forks32.5k
Repos74
PyPI
Packages4
HuggingFace
Linked Repos1

Related Topics