Beyond ROI: Why AI Needs the Equitable Evaluation Framework

In the rush to “solve” poverty or “optimize” social services through Artificial Intelligence, the nonprofit sector has found itself at a crossroads. On one path is the Technical Evidence approach: a world of Randomized Control Trials (RCTs), p-values, and “objective” efficiency. On the other is Equitable Evaluation: an approach championed by the Equitable Evaluation Institute that views data not as a neutral tool, but as a tool for power that can either liberate or oppress.

As AI use scales within development and social sector contexts, the choice of which path to take is no longer just a methodological preference—it is a moral imperative.

The Clash of Two Worldviews

The Technical Evidence approach is undeniably useful for preventing wasted resources. However, it often operates under the “Orthodoxies” of traditional research: that the researcher is the sole expert and that “objectivity” is the highest goal. In doing so, it treats marginalized communities as subjects to be studied rather than partners in progress.

The following table highlights the fundamental shift required to move from a purely technical lens to an equitable one:

Feature	Technical Evidence/Highest ROI Approach	Equitable Evaluation Approach (EEI)
Primary Audience	Funders, Academics, and Board Members	The Community and Impacted Stakeholders
Evaluator’s Role	Neutral, outside “objective” expert	Partner, facilitator, and advocate
Definition of Rigor	Statistical significance and radom control trial design	Cultural rigor and lived experience
Goal of AI	Efficiency, scalability, and “ROI”	Shifting power and restoring justice

When “Technical ROI Success” is a “Social Failure”

In the nonprofit and development sectors, a program can be a statistical triumph while remaining a tool for systemic oppression. Without the EEF’s requirement to account for Historical and Contemporary Context, we risk automating the status quo. Consider these three examples of the “ROI vs. Equity” trade-off:

1. The Growth vs. Need Paradox (J-PAL’s Cash Transfer Trade-offs)

The Abdul Latif Jameel Poverty Action Lab (J-PAL)’s AI Evidence Playbook highlights research on cash transfers that provides a classic example of this tension. In certain pilot programs, evaluations found that unrestricted cash grants targeted at those just below the poverty line generated the highest overall economic growth per family. From a technical “Return on Investment” (ROI) standpoint, this was a massive success.

The Equity Failure: This strategy effectively de-prioritizes the “ultra-poor”—those in the deepest need who require more intensive support to see similar gains. By optimizing for the highest aggregate growth (Efficiency and ROI), the program systematically excluded the most marginalized (Equity). A technical win that inadvertently widens the gap within the community is not a success by equitable standards.

2. The Dutch “SyRI” Fraud Detection Scandal

The Dutch government deployed an AI system (SyRI) to identify risks of benefits fraud. Technically, the system was a success; it processed vast amounts of data to flag “high-risk” individuals more efficiently than human auditors ever could.

The Equity Failure: The evaluation of the tool’s “success” failed to account for context. The algorithm disproportionately flagged families in low-income neighborhoods and those with dual nationalities. While it was “technically accurate” according to its programmed parameters, it became a tool for state-sponsored discrimination. It took a landmark court ruling to stop the tool because the technical evaluation never asked: “Whom is this harming?”

3. The “Optimization” of Medical Care (The Optum Case)

A widely used AI algorithm was designed to identify which patients needed “high-risk care management.” Technically, it performed brilliantly at its stated goal: predicting which patients would cost the most to treat in the future.

The Equity Failure: Because the algorithm used “past healthcare spending” as a proxy for “health need,” it failed to account for the historical reality that Black patients often have less access to healthcare and thus lower historical spending. As a result, the AI consistently ranked healthier White patients as “higher risk” than sicker Black patients. The tool was technically “objective,” but because the evaluation ignored the Historical Context of medical racism, it automated a life-threatening bias.

Nonprofit and Philanthropy’s Moral Mandate

Nonprofits exist to fill the gaps created by market and state failures. If we adopt AI using the same “efficiency-first” metrics as the private sector or even government actors, we betray our missions. Equitable Evaluation offers three non-negotiable shifts for our sector:

From Efficiency to Service: We must stop evaluating AI to satisfy “ROI” for funders and start evaluating it to serve the community’s self-determination.
From External Experts to Community Ownership: The “validity” of an AI tool should be determined by those it affects, not just those who coded it.
From Neutrality to Advocacy: We must accept that evaluation is a political act. If our AI evaluation isn’t seeking to dismantle inequality, it is maintaining it, whether it’s using a random control trial or not.

Conclusion: A New Standard for Rigor

“Rigor” has long been a gatekeeping term used to prioritize quantitative data over lived experience. It is time we redefine it. True rigor in the age of AI is Cultural Rigor. It is the ability to see the ghost of past discrimination in a dataset and the courage to stop a “scalable” tool if it compromises the dignity of a single community.

For philanthropy and the nonprofit sector, the goal of AI shouldn’t just be to do things faster, better, or more efficiently; it must be to do things that dismantle systemic oppression with intention.