The Vendor Selection Problem: Choosing AI Tools for CRE Workflows

Every CRE firm evaluating AI tools has the same problem. The market is crowded. The demos all look impressive. The vendor decks all describe the same capabilities. The reference customers all report enthusiastic adoption. The pilot results are inconsistent across firms. The selection process produces decisions that the team second-guesses within a year.

The problem is that most evaluation frameworks were built for traditional software. They emphasize features, integrations, and price. AI tools require a different framework because the value depends on workflow fit, output quality, and the firm's verification discipline. A tool with weaker features and stronger workflow fit will outperform a tool with stronger features and weaker fit, every time.

The vendor selection problem is a workflow design problem disguised as a procurement problem. Treating it as procurement produces bad decisions. Treating it as workflow design produces decisions that hold up.

What Most Evaluation Frameworks Miss

A traditional software evaluation rates vendors on features, integrations, security, support, and price. Each dimension is scored. The vendor with the highest aggregate score wins. The framework is rational, structured, and produces the wrong answer for AI tools.

The reason is that AI value is not a feature checklist. The value comes from the alignment between the tool's capabilities and the firm's specific workflow, including the documents the firm processes, the verification standards the firm enforces, and the downstream systems the firm uses. Two firms evaluating the same tool will get different value because their workflows are different.

Traditional Criterion	What It Measures	What It Misses
Feature completeness	Breadth of capabilities	Depth in the specific workflow
Integration count	Number of connectors	Quality of relevant integrations
Security certifications	Compliance posture	Audit trail and provenance
Reference customers	Vendor's customer base	Workflow similarity to your firm
Price	Cost	Cost-adjusted output quality

The criteria that matter most for AI tools are largely absent from traditional frameworks. Output quality on the firm's actual documents. Audit trail completeness. Workflow integration depth. Verification overhead. Failure mode visibility. Each requires a different evaluation method than feature scoring.

The Criteria That Actually Matter

A framework that produces good AI tool decisions emphasizes five dimensions.

Output quality on the firm's documents. The vendor's claimed accuracy on benchmark documents is not predictive of accuracy on the firm's documents. The firm has to test the tool on its own document set, including the messy and edge-case documents that the workflow actually encounters. A tool that handles clean documents well and messy documents poorly will produce inconsistent value.

Audit trail and provenance. Every output must trace to a source. The tool either provides citations as part of its data model or it does not. A tool without audit trail capability will produce values the team cannot defend, regardless of how accurate those values are.

Workflow integration depth. The tool has to fit into the firm's existing workflow without requiring the workflow to be rebuilt around the tool. The integration has to handle the inputs the workflow consumes, produce the outputs the workflow expects, and surface exceptions in a way the team can act on.

Verification overhead. The tool's output requires human verification. The verification overhead determines whether the tool produces net efficiency. A tool whose output requires more verification time than manual production saves is a net cost regardless of how impressive the demo.

Failure mode visibility. When the tool produces a wrong answer, how does the team know? A tool that fails silently is dangerous. A tool that surfaces low confidence, flags conflicts, and routes exceptions for review is safe. The visibility of failure modes determines whether the tool can be trusted at scale.

Dimension	What to Test
Output quality	Accuracy on the firm's actual documents, including edge cases
Audit trail	Citation completeness, source page references, original language
Workflow integration	Input handling, output format, exception surfacing
Verification overhead	Time required to verify a typical output
Failure mode visibility	Confidence scores, conflict flags, escalation routing

How to Test Each Dimension

A vendor pilot should be designed to test these dimensions, not to confirm the demo. The pilot is the only stage where the firm can collect evidence about the actual fit.

The output quality test requires running the tool on a representative sample of the firm's documents. The sample must include the easy cases and the hard cases. The team measures accuracy by comparing the tool's output to a verified ground truth, ideally produced manually by a senior team member. The accuracy on hard cases is what predicts the tool's value at scale, not the average accuracy across all cases.

The audit trail test requires inspecting the data model. Where does the output live? What metadata accompanies each value? Can a reviewer click from a value to its source page? Can the tool produce a verification log? Tools that lack any of these capabilities should be eliminated regardless of their other strengths.

The workflow integration test requires running the tool through a complete deal cycle, not a single document. The tool either fits into the cycle or it does not. The team identifies the points where the tool requires the workflow to change and assesses whether the change is acceptable.

The verification overhead test requires timing the verification work. The team measures how long it takes a reviewer to verify the tool's output for a representative document. The measurement compares to the time required to produce the same output manually. The comparison reveals whether the tool produces net efficiency or net cost.

The failure mode test requires deliberately running the tool on documents that are likely to produce errors. Damaged scans, unusual lease structures, documents with handwritten amendments. The team observes how the tool reports its uncertainty. A tool that reports high confidence on documents where it is wrong is the most dangerous failure mode.

Vendor Patterns That Predict Failure

Three vendor patterns consistently predict deployment failure.

Demo-only accuracy claims. The vendor's accuracy figures come from controlled demonstrations on curated documents. The vendor declines to test on the firm's own document set, or insists on a curated subset. This pattern indicates that the tool's accuracy degrades on real documents and that the vendor knows it.

Black box outputs. The tool produces values without surfacing source citations, confidence scores, or conflict flags. The vendor positions this as simplicity. The firm should position it as a defect. A black box output cannot be defended at IC review or lender diligence.

Workflow rebuild requirements. The tool requires the firm to rebuild its workflow to accommodate the tool's data model. The vendor positions this as best practice. The firm should position it as integration risk. A tool that cannot fit into the existing workflow will not survive the team's resistance to changing the workflow.

Pattern	Why It Predicts Failure
Demo-only accuracy	Real-document performance is unknown or worse
Black box outputs	Outputs cannot be verified or defended
Workflow rebuild	Adoption requires sustained change management

A vendor exhibiting any of these patterns should be approached with elevated skepticism, regardless of how strong the rest of the pitch appears.

What "Done" Looks Like for Vendor Selection

A defensible AI vendor selection meets the following criteria:

The pilot tested the tool on the firm's actual documents, including hard cases.
The audit trail capability has been inspected and verified at the data model level.
The workflow integration has been validated through a complete deal cycle, not a single document.
The verification overhead has been measured and the tool produces net efficiency in the workflow.
The failure modes have been deliberately tested and the tool surfaces uncertainty appropriately.

If any of these criteria are unmet, the selection is not defensible.

Conclusion

Vendor selection for AI tools is not a procurement exercise. It is a workflow design exercise that uses procurement methods. Firms that select on features, demos, and reference calls will continue to acquire tools that underperform in deployment. Firms that select on output quality, audit trail, integration depth, verification overhead, and failure mode visibility will acquire tools that compound value across deals. The vendor question is not which tool has the most capabilities. The question is which tool produces verifiable output in the firm's actual workflow at acceptable verification cost. The teams that ask the right question get the right answer.