evaluation-framework

Star

Here are 152 public repositories matching this topic...

EleutherAI / lm-evaluation-harness

Star

A framework for few-shot evaluation of language models.

transformer language-model evaluation-framework

Updated Dec 18, 2024
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Dec 18, 2024
TypeScript

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for AI & LLM systems

Updated Dec 17, 2024
Python

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Dec 18, 2024
Python

MaurizioFD / RecSys2019_DeepLearning_Evaluation

Star

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Updated May 25, 2023
Python

huggingface / lighteval

Star

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

Updated Dec 18, 2024
Python

relari-ai / continuous-eval

Star

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Dec 18, 2024
Python

TonicAI / tonic_validate

Star

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics evaluation-framework rag large-language-models llm llms llmops retrieval-augmented-generation

Updated Nov 14, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Dec 18, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

diningphil / PyDGN

Star

A research library for automating experiments on Deep Graph Networks

evaluation-framework deep-graph-networks deep-learning-for-graphs

Updated Sep 9, 2024
Python

zeno-ml / zeno

Star

AI Data Management & Evaluation Platform

python data-science machine-learning ai evaluation evaluation-framework

Updated Oct 5, 2023
Svelte

aiverify-foundation / moonshot

Star

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework red-teaming trustworthy-ai llm

Updated Dec 17, 2024
Python

lartpang / PySODEvalToolkit

Star

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Updated Sep 27, 2024
Python

bijington / expressive

Sponsor

Star

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

xamarin parsing cross-platform evaluation netstandard expression-parser expression-evaluator hacktoberfest evaluation-framework

Updated Oct 1, 2024
C#

ServiceNow / AgentLab

Star

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

benchmark agents evaluation-framework web-agents llm prompting llm-agents