Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on SWE-Bench #415

Open
distbit0 opened this issue Apr 10, 2024 · 1 comment · May be fixed by #670
Open

Benchmark on SWE-Bench #415

distbit0 opened this issue Apr 10, 2024 · 1 comment · May be fixed by #670

Comments

@distbit0
Copy link

distbit0 commented Apr 10, 2024

It would be interesting to see the performance on SWE-Bench benchmarks, so that this project can be more clearly differentiated from the increasing number of other coding agents.

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

@distbit0 distbit0 changed the title Run Benchmark on SWE-Bench Benchmark on SWE-Bench Apr 10, 2024
@TomLucidor
Copy link

Seconding this but not sure how this can be done, also what other benchmarks are worth testing?

erkinalp added a commit to erkinalp/devika that referenced this issue Dec 18, 2024
- Add Docker-based evaluation harness
- Implement comprehensive test coverage
- Add SWE-bench dependencies
- Support batch evaluation with proper error handling

Fixes stitionai#415

Co-Authored-By: Erkin Alp Güney <[email protected]>
erkinalp added a commit to erkinalp/devika that referenced this issue Dec 18, 2024
- Add Docker-based evaluation harness
- Implement comprehensive test coverage
- Add SWE-bench dependencies
- Support batch evaluation with proper error handling

Fixes stitionai#415

Co-Authored-By: Erkin Alp Güney <[email protected]>
erkinalp added a commit to erkinalp/devika that referenced this issue Dec 18, 2024
@erkinalp erkinalp linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants