Misuse of data and AI evaluation metrics

From: aidotengineer

Achieving complete and spectacular failure in an AI strategy involves specific approaches to data handling and AI evaluation. These methods prioritize chaos and ensure projects are torpedoed, alienating everyone involved [00:00:57].

Embracing “Worse Practices” in AI Evaluation

Instead of following best practices, the focus should be on embracing “worse practices” to guarantee project failure [00:00:54].

Strategic Neglect of Evaluation

When measuring progress in AI, it is advised to use every generic, off-the-shelf evaluation metric available [00:11:01]. It is crucial to never bother customizing these metrics to business needs [00:11:05]. Instead, blindly trust the numbers, even if they make no sense [00:11:07]. If AI agents are not working, the solution should be to pick a new framework and vendor, then finetune without any measurement or evaluation [00:11:15].

It is recommended to adopt a mindset that evaluations are solely a vendor’s problem, assuming a one-size-fits-all solution exists [00:12:09].

Dashboard of Confusion

To ensure minimal understanding, a dashboard should be created displaying every off-the-shelf metric gatherable [00:12:26]. The more metrics, the better, regardless of whether they track real outcomes or failure modes [00:12:35]. The numbers should be unintelligible to prevent understanding the difference between performance levels [00:12:39]. The goal is to keep hoarding random metrics until one shows an upward trend, then claim success [00:12:47].

It is advised to adopt all metrics from evaluation frameworks and let them guide blindly, never questioning if they measure actual success [00:13:04]. Optimizing for metrics like cosine similarity, BLEU, and ROUGE is preferred, completely ignoring actual user experience [00:13:17]. Cross-checking with domain experts or users should be avoided, as an AI’s accuracy claim should be accepted without argument [00:13:24]. These practices contribute to challenges and trust issues with AI benchmarks.

Avoiding Data at All Costs

A potent technique to achieve dysfunction is to actively avoid looking at data [00:13:42].

It is asserted that one can absolutely trust an AI’s output without ever looking at it oneself [00:13:58]. Looking at data is deemed an “engineering problem,” beneath the concern of leaders who have more important strategic tasks like attending meetings about meetings [00:14:03]. Developers are considered to have more domain expertise than business teams [00:14:18].

Customer as QA

Customers are considered the best Quality Assurance (QA) [00:14:23]. It is important to trust one’s gut feelings, as they are a reliable substitute for data, especially when making million-dollar decisions [00:14:34].

Data Inaccessibility

Engineers are viewed as coding wizards who will handle everything, regardless of their lack of customer interaction [00:14:54]. Simpler data annotation options like spreadsheets should be quickly forgotten [00:15:05].

It is crucial to ensure that no one else is looking at data [00:15:20]. The best way to achieve this is by storing data in complex systems accessible only to engineers, thereby making it unavailable to domain experts [00:15:24]. Executives should insist on purchasing custom data analysis platforms that require a team of PhDs to operate and understand [00:15:37]. Bonus points are awarded if it takes six months to load and produces incessant errors [00:15:47]. These actions contribute to the mismanagement of AI resources.

Tubegraph

Explorer

Table of Contents