LatticeFlow's LLM framework makes an initial attempt at evaluating Big AI's adherence to the EU AI Act regulations.

Meanwhile, the European Union is out ahead of most countries' lawmakers. It passed an AI risk-based framework on regulation of apps this year.
LatticeFlow's LLM framework makes an initial attempt at evaluating Big AI's adherence to the EU AI Act regulations.

Meanwhile, the European Union is out ahead of most countries' lawmakers. It passed an AI risk-based framework on regulation of apps this year.

The law came into effect in August, though comprehensive details of the pan-EU AI governance regime are still coming together – Codes of Practice are being drafted, for instance. Still, however, in the coming months and years, the provisions of the law on tiered levels will start to apply to AI app and model makers, so the compliance countdown is already live and ticking.

The next challenge is to determine whether and how AI models fulfill their legal obligations. LLMs and other so-called foundation or general purpose AIs will form the basis for most AI applications. Thus, focusing the assessment efforts at this layer of the AI stack seems to be important.

Step up LatticeFlow AI, which spun out from the public research university ETH Zurich, and focuses on AI risk management and compliance.

On Wednesday, it published what it's touting as the first technical interpretation of the EU AI Act, meaning it's sought to map regulatory requirements to technical ones, alongside an open-source LLM validation framework that draws on this work — which it's calling Compl-AI ('compl-ai'… see what they did there!).

According to LatticeFlow, the AI model evaluation initiative — which they also call "the first regulation-oriented LLM benchmarking suite" — is the fruit of long-term cooperation between the Swiss Federal Institute of Technology and Bulgaria's Institute for Computer Science, Artificial Intelligence and Technology (INSAIT).

AI model developers can apply on the Compl-AI website for the evaluation of their technology regarding compliance with the requirements set by the EU AI Act.

It has also published model evaluations for several mainstream LLMs, including the various versions and sizes of Meta's Llama models and OpenAI's GPT, as well as an EU AI Act compliance leaderboard for Big AI.
The latter ranks the performance of models from outfits such as Anthropic, Google, OpenAI, Meta and Mistral against the requirements of the law — on a scale of 0, i.e. no compliance, through to 1, i.e. full compliance.

Other metrics are marked as N/A if information is unavailable or if the model creator hasn't implemented the feature. (NB: At the time of writing there were also some minus scores recorded but we're told that was down to a bug in the Hugging Face interface.)

Rating on LLM responses on 27 benchmarks, such as "toxic completions of benign text", "prejudiced answers", "following harmful instructions", "truthfulness", and "common sense reasoning" to name a few of the benchmarking categories it's using for the evaluations. So, every model gets a range of scores in every column (or else N/A).

AI compliance a mixed bag
So, how did the major LLMs do? There is no overall model score. So performance varies depending on just what's being evaluated-but there are some notable highs and lows across the various benchmarks.

For example there is robust performance for all the models at not following harmful instructions, and relatively robust performance across the board at not producing prejudiced answers—but scores on reasoning and general knowledge were much more of a mixed bag.

Elsewhere, recommendation consistency, which the framework uses as a measure of fairness, was very weak for all models—none of them scored above the halfway mark (and most well below that).

Other areas, such as training data suitability and watermark reliability and robustness, seem fundamentally unaugmented because of just how many results have been marked N/A.

LatticeFlow does concede that there are some domains on which the compliance of models is harder to determine, including hot-button issues, such as copyright and privacy. So it's not touting a panacea.

In a report detailing work done on the framework, the scientists involved in the effort note that most of the smaller models they tested ("≤ 13B parameters") "scored poorly on technical robustness and safety.".

They also found that "almost all examined models struggle to achieve high levels of diversity, non-discrimination, and fairness.".

But on that balance, it's just the starting point, they argue further, indicating: "We believe that these weaknesses are primarily driven by model providers displacing their focus on enhancing model capabilities over other critical factors described under the EU AI Act's regulatory requirements." In the wake of upcoming compliance deadlines, the makers of LLMs will have to shift their focus on the identified areas of concern – "which will result in a more balanced development of LLMs".

And as no one knows, at this point, what will be required to comply with the EU AI Act, LatticeFlow's framework is, in that sense, necessarily a work in progress. It is also just one interpretation of how the requirements of the law might translate into technical outputs that could be benchmarked and compared. It's an interesting start to what will need to be an ongoing effort to probe powerful automation technologies and try to steer their developers toward safer utility.

The framework is one step towards full compliance-centered evaluation of the EU AI Act — but is created in a way to be easy to update to move in lock-step as the Act gets updated and the various working groups make progress," LatticeFlow CEO Petar Tsankov said in a statement to TechCrunch. "The EU Commission supports this. We expect the community and industry to continue to develop the framework toward full and comprehensive AI Act assessment platform.".

Summarizing the main takeaways so far, Tsankov said it's clear AI models have "predominantly been optimized for capabilities rather than compliance". He also flagged "notable performance gaps", observing that some of the higher capability models can be on a par with weaker models when it comes to compliance".

Of particular concern is cyberattack resilience at the model level and fairness, said Tsankov, to whom many models scored below 50 percent in the former area.

"While Anthropic and OpenAI managed to calibrate their (closed) models to score against jailbreaks and prompt injections, open-source vendors such as Mistral kept it less tight on this," he said.

And with "most models" performing equally abysmally on fairness benchmarks he suggested that this should be a high priority for future work.

Even in the case of copyright and privacy, according to Tsankov, "one thing about benchmarking the performance of the LLM is that currently the benchmarks would check only for copyright books. This has two major limitations: (i) it does not tell one about possible copyright violation in materials other than those particular books; and (ii) it is based on attempts to quantify the size of model memorization, which is notoriously hard.".

"For privacy the challenge is the same: the benchmark only tries to figure out if the model has memorized specific sensitive personal information."

LatticeFlow hopes that the free and open source framework will be adopted and improved by the broader AI research community.

We invite AI researchers, developers, and regulators to collaborate with us on this open project," said ETH Zurich professor Martin Vechev, also founder and scientific director of INSAIT. "We invite further research groups and practitioners to fine-tune the mapping for the AI Act, develop more benchmarks, and generalize this open-source framework.".

This approach can further be used to test AI models against other evolving regulations outside the EU AI Act, making it a very handy tool for organizations operating in multiple jurisdictions.

 

Blog
|
2024-10-16 18:14:38