Self-Calling Code Standards help you decide which LLMs you will use in your programming assignments

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

As large language models (LLMs) continue to improve in programming, the criteria used to evaluate their performance have become steadily less useful.

This is because although many MBAs have similarly high scores on these criteria, understanding which ones to use in specific software development projects and organizations can be difficult.

A new research paper conducted by Yale University and Tsinghua University offers a new way to test models’ ability to process “Generate self-calling codeProblems that require thinking, generating code, and reusing existing code to solve problems.

Generating self-invoking code is more similar to real-life programming scenarios and provides a better understanding of the ability of current LLMs to solve real-world coding problems.

Generate self-invoking code

Two common criteria used to evaluate the coding capabilities of LLMs are HumanEval and Mbb (Mostly basic Python problems). These are datasets for handcrafted problems that require the model to write code for simple tasks.

However, these standards cover only a subset of the challenges software developers face in the real world. In practical scenarios, software developers not only write new code, but also must understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and subsequently utilize generated code, i.e., generate self-invoking code, plays an important role for LLM students to leverage their reasoning abilities to generate code that existing standards fail to capture,” the researchers wrote.

To test the ability of master’s degree holders to generate self-recalling code, the researchers created two new criteria: HumanEval Pro and MBPP Prowhich expands existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on an example in the original dataset and introduces additional elements that require the model to solve the basic problem and call the solution to solve a more complex problem.

For example, the original problem could be something simple, such as writing a function that replaces all occurrences of a given character in a string with a new character.

The extended problem would be to write a function that changes occurrences of multiple characters in a string with their specified alternatives. This would require the model to write a new function that calls the previous function it created in the simple problem.

“This assessment of self-recall code generation provides deeper insights into MBA students’ programming abilities that extend beyond single-problem code generation,” the researchers wrote.

LLMs perform poorly when generating self-calling code

Researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including… GPT-4oOpen Eye o1-Mini, Claude 3.5 Sonnetas well as Qwen, DeepSeek, and Codestral series.

Their findings show a significant discrepancy between traditional coding standards and self-recall code generation tasks. “While MBA students excel at borderline creation of individual code snippets, they often struggle to effectively use their code to solve more complex problems,” the researchers wrote.

For example, with one generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.

Another interesting finding is that although instruction tuning provides significant improvements on simple programming tasks, it shows diminishing returns when generating self-calling code. “Current instruction-based fine-tuning methods are not efficient enough for more complex code generation tasks, suggesting that we need to rethink how we train underlying models for coding and inference tasks,” the researchers note.

To help advance research on self-recall code generation, researchers propose a technique to automatically repurpose existing coding standards to generate self-recall codes. This approach uses parametric LLMs to generate self-recalling problems based on the original problems. They then create candidate solutions and validate them by executing the code and running test cases on them. Pipeline reduces the need for manual code review to help create more examples with less effort.

Automatically generate self-calling code generation issues (Source: arXiv)

Complex landscape

This new set of standards comes at a time when older programming standards are rapidly being overtaken by leading models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already score very high scores in HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+.

Meanwhile, there are more complex standards such as: Take a seatwhich evaluates the models’ capabilities in comprehensive software engineering tasks that require a wide range of skills such as using libraries and external files and managing DevOps tools. SWE-Bench is a very demanding benchmark, and even the most advanced models show mediocre performance. For example, OpenAI o1 is inconsistent with SWE-Bench Verified.

Surprising find: OpenAI's O1 – reasoning-high only hit 30% on SWE-Bench Verified – far below their 48.9% claim. Even more interesting: Claude achieves 53% in the same framework. Something's off with O1's "enhanced reasoning"… 🧵1/8 pic.twitter.com/ADLXNuKpPP
— Alejandro Cuadron (@Alex_Cuadron) January 5, 2025

Self-invoking code generation falls somewhere between simple benchmarks and SWE-Bench. It helps assess a very specific type of reasoning ability: using code within a module to address complex problems. Self-calling code standards can serve as a very practical proxy for the utility of LLMs in real-world environments, where human programmers are controlled and assisted by AI assistants to accomplish specific coding tasks in the software development process.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to inspire future LLM development by highlighting shortcomings of the current model and encouraging innovation in training methodologies,” the researchers wrote.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.