Evaluation | AISignal

ML

mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

Trend 21

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering

25.2k 5.5k +23/wk

GitHub PyPI 2-source

LA

langfuse/langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Trend 20

analytics autogen evaluation langchain large-language-models llama-index llm llm-evaluation llm-observability llmops monitoring observability open-source openai playground prompt-engineering prompt-management self-hosted ycombinator

24.6k 2.5k +76/wk

GitHub HuggingFace PyPI 3-source

OP

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Trend 18

evaluation hacktoberfest hacktoberfest2025 langchain llama-index llm llm-evaluation llm-observability llmops open-source openai playground prompt-engineering

18.7k 1.4k +20/wk

GitHub PyPI 2-source

OB

BlazeUp-AI/Observal

Observal is an observability platform and local registry for MCPs, hooks, skills, graphRAGs and more!

Trend 4

cli-tool evaluation large-language-models llm llm-evaluation llm-observability llmops monitoring observability open-source playground self-hosted

148 14 +4/wk

GitHub

RE

InternScience/ResearchClawBench

ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

Trend 4

agent ai ai-agent ai-scientist ai4science auto-research benchmark claude claude-code clawdbot codex discovery end-to-end evaluation llm openai openclaw research-claw science

66 5 +3/wk

GitHub

AR

arklexai/arksim

Find your agents errors be fore your real users do

Trend 3

agents ai chatbot conversational-ai evaluation llm open-source python simulation testing

141 11 +0/wk

GitHub

ME

memvid/memvid

Memory layer for AI Agents. Replace complex RAG pipelines with a serverless, single-file memory layer. Give your agents instant retrieval and long-term memory.

Trend 3

ai context embedded faiss knowledge-base knowledge-graph llm machine-learning memory memvid mv2 nlp offline-first opencv python rag retrieval-augmented-generation semantic-search vector-database video-processing

14.7k 1.3k +56/wk

GitHub

OU

Q00/ouroboros

Stop prompting. Start specifying.

Trend 3

ai-agent claude-code codex-cli devtools evaluation llm mcp multi-agent prompt-engineering python spec-driven-development workflow-automation

2.1k 199 +12/wk

GitHub

KE

ScalingIntelligence/KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Trend 3

benchmark codegen evaluation gpu rl-environment tooling

916 150 +4/wk

GitHub

PR

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Trend 3

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

19.8k 1.7k +45/wk

GitHub PyPI 2-source

RA

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

Trend 3

evaluation llm llmops

13.3k 1.3k +17/wk

GitHub

LM

lmnr-ai/lmnr

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Trend 3

agent-observability agents ai ai-observability aiops analytics developer-tools evals evaluation llm-evaluation llm-observability llmops monitoring observability open-source rust rust-lang self-hosted ts typescript

2.8k 191 +3/wk

GitHub

EV

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Trend 3

evaluation llm performance rag vlm

2.6k 301 +4/wk

GitHub

GR

1517005260/graph-rag-agent

拼好RAG：手搓并融合了GraphRAG、LightRAG、Neo4j-llm-graph-builder进行知识图谱构建以及搜索；整合DeepSearch技术实现私域RAG的推理；自制针对GraphRAG的评估框架| Integrate GraphRAG, LightRAG, and Neo4j-llm-graph-builder for knowledge graph construction and search. Combine DeepSearch for private RAG reasoning. Create a custom evaluation framework for GraphRAG.

Trend 3

agentic-rag chain-of-exploration deepresearch deepsearch evaluation graphrag graphsearch kg lightrag reasoning think-on-graph

2.1k 283 +4/wk

GitHub

AL

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

Trend 3

agent awsome-list benchmark blogs compress evaluation large-language-models length-extrapolation llm long-context-modeling long-term-memory longcot papers rag ssm survey transformer

2.0k 83 +1/wk

GitHub

LE

MLGroupJLU/LLM-eval-survey

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Trend 3

benchmark evaluation large-language-models llm llms model-assessment

1.6k 99 +1/wk

GitHub

MM

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Trend 3

computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning multimodal multimodal-deep-learning multimodal-learning multimodality natural-language-processing question-answering stem visual-question-answering

553 50 +1/wk

GitHub

OP

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Trend 3

agent agent-skills ai-agent alignment evaluation grader llm reward reward-model rlhf skill-md skills

531 45 +3/wk

GitHub

MA

facebookresearch/meta-agents-research-environments

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.

Trend 3

agents ai autonomous-agents benchmark evaluation large-language-models llm meta multi-agent-systems natural-language-processing reinforcement-learning rl simulation

469 63 +1/wk

GitHub

UA

OpenBMB/UltraEval-Audio

Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测，知己知彼。

Trend 3

evaluation speech-recognition speech-to-speech speech-to-text

286 21 +0/wk

GitHub

EV

strands-agents/evals

A comprehensive evaluation framework for AI agents and LLM applications.

Trend 3

agentic agentic-ai ai evaluation machine-learning python strands-agents

102 27 +0/wk

GitHub

LA

gil-son/language-ai-engineering-lab

Language AI Engineering Lab, a place where you can deeply understand and build modern Language AI systems, from fundamentals to production.

Trend 3

context-engineering embeedings evaluation generative-ai langchain llm nlg nlp nlu ollama openai prompt-engineering rag tokenization transformers

94 19 +1/wk

GitHub

WE

Tencent/WeKnora

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

Trend 3

agent agentic ai chatbot chatbots embeddings evaluation generative-ai golang knowledge-base llm multi-tenant multimodel ollama openai question-answering rag reranking semantic-search vector-search

13.8k 1.6k +12/wk

GitHub

TE

tensorzero/tensorzero

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

Trend 3

ai ai-engineering anthropic artificial-intelligence deep-learning genai generative-ai gpt large-language-models llama llm llmops llms machine-learning ml ml-engineering mlops openai python rust

11.2k 806 -3/wk

GitHub

OU

oumi-ai/oumi

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Trend 3

dpo evaluation fine-tuning gpt-oss gpt-oss-120b gpt-oss-20b inference llama llms sft slms vlms

9.2k 744 +2/wk

GitHub

OP

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Trend 3

benchmark chatgpt evaluation large-language-model llama2 llama3 llm openai

6.8k 755 +2/wk

GitHub

HE

Helicone/helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

Trend 3

agent-monitoring analytics evaluation gpt langchain large-language-models llama-index llm llm-cost llm-evaluation llm-observability llmops monitoring open-source openai playground prompt-engineering prompt-management ycombinator

5.5k 507 +1/wk

GitHub

CL

coze-dev/coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

Trend 3

agent agent-evaluation agent-observability agentops ai coze eino evaluation langchain llm-observability llmops monitoring observability open-source openai playground prompt-management

5.4k 745 +4/wk

GitHub

KI

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Trend 3

ai chain-of-thought collaboration dataset-generation evals evaluation evaluation-framework fine-tuning machine-learning macos mcp ml ollama openai prompt prompt-engineering python rlhf synthetic-data windows

4.7k 352 -1/wk

GitHub

AU

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Trend 3

analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation

4.7k 389 -1/wk

GitHub

AG

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Trend 3

agents evaluation llm-as-a-judge llm-evaluation llm-framework llm-monitoring llm-observability llm-platform llm-playground llm-tools llmops observability prompt-engineering prompt-management rag-evaluation

4.0k 506 +0/wk

GitHub

VL

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Trend 3

chatgpt claude clip computer-vision evaluation gemini gpt gpt-4v gpt4 large-language-models llava llm multi-modal openai openai-api pytorch qwen vit vqa

4.0k 672 +2/wk

GitHub

LE

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Trend 3

agi audio-evaluation benchmark evaluation large-language-models llm-evaluation multimodal multimodal-evaluation video-understanding vision-language-model vlm

4.0k 557 +0/wk

GitHub

LA

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

Trend 3

ai analytics datasets dspy evaluation gpt llm llm-ops llmops low-code observability openai prompt-engineering

3.2k 307 -1/wk

GitHub

PR

microsoftarchive/promptbench

A unified evaluation framework for large language models

Trend 3

adversarial-attacks benchmark chatgpt evaluation large-language-models prompt prompt-engineering robustness

2.8k 219 -1/wk

GitHub

EV

huggingface/evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Trend 3

evaluation machine-learning

2.4k 313 +0/wk

GitHub

LI

huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Trend 3

evaluation evaluation-framework evaluation-metrics huggingface

2.4k 449 +0/wk

GitHub

UP

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Trend 3

autoevaluation evaluation experimentation hallucination-detection jailbreak-detection llm-eval llm-prompting llm-test llmops machine-learning monitoring openai-evals prompt-engineering root-cause-analysis

2.3k 204 +0/wk

GitHub

EG

huggingface/evaluation-guidebook

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

Trend 3

evaluation evaluation-metrics guidebook large-language-models llm machine-learning tutorial

2.1k 122 +1/wk

GitHub

AB

xinshuoweng/AB3DMOT

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

Trend 3

2d-mot-evaluation 3d-mot 3d-multi 3d-multi-object-tracking 3d-tracking computer-vision evaluation evaluation-metrics kitti kitti-3d machine-learning multi-object-tracking real-time robotics tracking

1.8k 416 +1/wk

GitHub

WF

onestardao/WFGY

WFGY is an open-source AI Troubleshooting Atlas for RAG, agents, and real-world AI workflows. Includes the 16-problem map, Global Debug Card, and WFGY 4.0. ⭐ Star to help more builders find this repo.

Trend 3

ai-agents alignment debugging evaluation graphrag hallucination information-retrieval knowledge-graph llm rag reasoning retrieval-augmented-generation

1.7k 161 +0/wk

GitHub

KU

run-house/kubetorch

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

Trend 3

artificial-intelligence aws data-processing data-science distributed evaluation gcp inference infrastructure kubernetes machine-learning observability python pytorch ray serverless training

1.2k 53 +0/wk

GitHub

TF

toshas/torch-fidelity

High-fidelity performance metrics for generative models in PyTorch

Trend 3

evaluation frechet-inception-distance gan generative-model inception-score kernel-inception-distance metrics perceptual-path-length precision pytorch reproducibility reproducible-research

1.2k 87 +1/wk

GitHub

DE

always-further/deepfabric

Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline

Trend 3

agents ai data-science dataset distillation evaluation fine-tuning huggingface huggingface-datasets machine-learning open open-source python source synthetic synthetic-data unsloth

851 80 +0/wk

GitHub

WC

angular/web-codegen-scorer

Web Codegen Scorer is a tool for evaluating the quality of web code generated by LLMs.

Trend 3

benchmarking codegen evaluation llm-coding

711 60 +0/wk

GitHub

LI

ModelTC/LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Trend 3

awq benchmark deepseek-v3 deployment evaluation internlm2 large-language-models llm mixtral pruning quantization smoothquant token-merging token-pruning token-reduction tool vllm wan

698 77 +0/wk

GitHub

AL

onejune2018/Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

Trend 3

awsome-list awsome-lists benchmark bert chatglm chatgpt dataset evaluation gpt3 large-language-model leaderboard llama llm llm-evaluation machine-learning nlp openai qwen rag

630 54 +1/wk

GitHub

TR

HowieHwong/TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Trend 3

ai benchmark dataset evaluation large-language-models llm natural-language-processing nlp pypi-package toolkit trustworthy-ai trustworthy-machine-learning

623 67 +0/wk

GitHub

TD

jkkummerfeld/text2sql-data

A collection of datasets that pair questions with SQL queries.

Trend 3

database dataset dynet evaluation natural-language-interface natural-language-processing neural-network nlp sql

587 116 +0/wk

GitHub

AF

SAILResearch/awesome-foundation-model-leaderboards

A curated list of awesome leaderboard-oriented resources for AI domain

Trend 3

ai-agent artificial-intelligence awesome-list benchmark deep-learning evaluation foundation-model large-ai-model leaderboard machine-learning ranking-system

337 40 -1/wk

GitHub

FM

zalandoresearch/fashion-mnist

A MNIST-like fashion product database. Benchmark :point_down:

Trend 0

benchmark computer-vision convolutional-neural-networks dataset deep-learning fashion fashion-mnist gan machine-learning mnist zalando

12.7k 3.1k -2/wk

GitHub

CH

ianarawjo/ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

Trend 0

ai evaluation large-language-models llmops llms prompt-engineering

3.0k 254 -2/wk

GitHub

AS

zzw922cn/Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

Trend 0

audio automatic-speech-recognition chinese-speech-recognition cnn data-preprocessing deep-learning end-to-end evaluation feature-vector layer-normalization lstm paper phonemes rnn rnn-encoder-decoder speech-recognition tensorflow timit-dataset

2.8k 534 +0/wk

GitHub

EV

Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

Trend 0

ai ai-challenges angularjs artificial-intelligence challenge codecov coveralls django docker evalai evaluation leaderboard machine-learning python reproducibility reproducible-research travis-ci

2.0k 987 +0/wk

GitHub

AE

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Trend 0

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

2.0k 306 +0/wk

GitHub

EM

SmartFlowAI/EmoLLM

心理健康大模型 (LLM x Mental Health), Pre & Post-training & Dataset & Evaluation & Depoly & RAG, with InternLM / Qwen / Baichuan / DeepSeek / Mixtral / LLama / GLM series models

Trend 0

dataset depoly evaluation llm post-training the-big-model-of-mental-health

1.7k 218 -2/wk

GitHub

RA

deepsense-ai/ragbits

Building blocks for rapid development of GenAI applications

Trend 0

agents document-search evaluation guardrails llms optimization prompts rag vector-stores

1.6k 135 -1/wk

GitHub

PY

sepandhaghighi/pycm

Multi-class confusion matrix library in Python

Trend 0

accuracy ai artificial-intelligence classification confusion-matrix data data-analysis data-mining data-science deep-learning deeplearning evaluation machine-learning mathematics matrix ml multiclass-classification neural-network statistical-analysis statistics

1.5k 123 +0/wk

GitHub

NE

Maluuba/nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Trend 0

bleu bleu-score cider dialog dialogue evaluation machine-translation meteor natural-language-generation natural-language-processing nlg nlp rouge rouge-l skip-thought-vectors skip-thoughts task-oriented-dialogue

1.4k 226 +0/wk

GitHub

XA

EthicalML/xai

XAI - An eXplainability toolbox for machine learning

Trend 0

ai artificial-intelligence bias bias-evaluation downsampling evaluation explainability explainable-ai explainable-ml feature-importance imbalance interpretability machine-learning machine-learning-explainability ml upsampling xai xai-library

1.2k 185 +0/wk

GitHub

SK

PRBonn/semantic-kitti-api

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

Trend 0

dataset deep-learning evaluation labels large-scale-dataset machine-learning semantic-scene-completion semantic-segmentation

894 195 +0/wk

GitHub

RF

IntelLabs/RAG-FiT

Framework for enhancing LLMs for RAG tasks using fine-tuning.

Trend 0

evaluation fine-tuning information-retrieval llm nlp question-answering rag semantic-search

769 62 +0/wk

GitHub

LF

google-deepmind/long-form-factuality

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

Trend 0

benchmark dataset evaluation factuality language language-modeling large-language-models metrics

678 82 +0/wk

GitHub

AU

ucinlp/autoprompt

AutoPrompt: Automatic Prompt Construction for Masked Language Models.

Trend 0

evaluation language-model nlp

640 87 +0/wk

GitHub

ST

SeekingDream/Static-to-Dynamic-LLMEval

The official GitHub repository of the paper "Recent advances in large language model benchmarks against data contamination: From static to dynamic evaluation"

Trend 0

benchmark dynamic-evaluation evaluation large-language-model llm llms testing

516 40 -7/wk

GitHub

RE

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

Trend 0

algorithms collaborative-filtering deep-learning distributed-computing evaluation machine-learning matrix-factorization pyspark pytorch recommendation-algorithms recommender-system recsys transformers

394 38 +0/wk

GitHub

RE

microsoft/rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

Trend 0

acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors

299 107 +0/wk

GitHub

AR

Ayanami0730/arag

A-RAG: Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces. State-of-the-art RAG framework with keyword, semantic, and chunk read tools for multi-hop QA.

Trend 0

agent agentic-ai agenticrag deepresearch evaluation graphrag llm llmagents rag reproduce

240 30 +0/wk

GitHub

NC

lechmazur/nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

Trend 0

benchmark claude evaluation gemini-pro gpt-5 grok4 llm llms-benchmarking puzzles reasoning testing

200 8 +0/wk

GitHub

EV

ai-twinkle/Eval

High-performance LLM evaluation framework with parallel API calls — up to 17× faster than sequential tools. Supports box, math, and logit-based evaluation.

Trend 0

eval evaluation llm

93 16 +0/wk

GitHub

EV

hidai25/eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

Trend 0

agent-benchmark agent-evaluation agentic-ai ai-agents anthropic autogen cli crewai evaluation langchain-agent langgraph llm mcp openai-assistants pytest python regression-testing testing

80 17 +0/wk

GitHub

SC

InternScience/SciEvalKit

A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.

Trend 0

agent ai ai4science code-generation evaluation evaluation-framework gemini gpt llm llm-evaluation vllm

79 10 +0/wk

GitHub

AE

arthur-ai/arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

Trend 0

agentic benchmarking evaluation genai guardrails llm ml monitoring tracing

74 10 +0/wk

GitHub

EV

dustalov/evalica

Evalica, your favourite evaluation toolkit

Trend 0

✦ New Signal

arena bradley-terry elo evalica evals evaluation hacktoberfest leaderboard library llm pagerank pairwise-comparison pyo3 python ranking rating rust serbia statistics winrate

62 5 +0/wk

GitHub

Top Projects (74)

mlflow/mlflow

langfuse/langfuse

comet-ml/opik

BlazeUp-AI/Observal

InternScience/ResearchClawBench

arklexai/arksim

memvid/memvid

Q00/ouroboros

ScalingIntelligence/KernelBench

promptfoo/promptfoo

vibrantlabsai/ragas

lmnr-ai/lmnr

modelscope/evalscope

1517005260/graph-rag-agent

Xnhyacinth/Awesome-LLM-Long-Context-Modeling

MLGroupJLU/LLM-eval-survey

MMMU-Benchmark/MMMU

agentscope-ai/OpenJudge

facebookresearch/meta-agents-research-environments

OpenBMB/UltraEval-Audio

strands-agents/evals

gil-son/language-ai-engineering-lab

Tencent/WeKnora

tensorzero/tensorzero

oumi-ai/oumi

open-compass/opencompass

Helicone/helicone

coze-dev/coze-loop

Kiln-AI/Kiln

Marker-Inc-Korea/AutoRAG

Agenta-AI/agenta

open-compass/VLMEvalKit

EvolvingLMMs-Lab/lmms-eval

langwatch/langwatch

microsoftarchive/promptbench

huggingface/evaluate

huggingface/lighteval

uptrain-ai/uptrain

huggingface/evaluation-guidebook

xinshuoweng/AB3DMOT

onestardao/WFGY

run-house/kubetorch

toshas/torch-fidelity

always-further/deepfabric

angular/web-codegen-scorer

ModelTC/LightCompress

onejune2018/Awesome-LLM-Eval

HowieHwong/TrustLLM

jkkummerfeld/text2sql-data

SAILResearch/awesome-foundation-model-leaderboards

zalandoresearch/fashion-mnist

ianarawjo/ChainForge

zzw922cn/Automatic_Speech_Recognition

Cloud-CV/EvalAI

tatsu-lab/alpaca_eval

SmartFlowAI/EmoLLM

deepsense-ai/ragbits

sepandhaghighi/pycm

Maluuba/nlg-eval

EthicalML/xai

PRBonn/semantic-kitti-api

IntelLabs/RAG-FiT

google-deepmind/long-form-factuality

ucinlp/autoprompt

SeekingDream/Static-to-Dynamic-LLMEval

sb-ai-lab/RePlay

microsoft/rag-experiment-accelerator

Ayanami0730/arag

lechmazur/nyt-connections

ai-twinkle/Eval

hidai25/eval-view

InternScience/SciEvalKit

arthur-ai/arthur-engine

dustalov/evalica

Source Breakdown

Related Topics