Data cleaning is a standard part of every quantitative research project. Responses are reviewed, quality flags are applied, low-quality completes are removed, and the final dataset is delivered to the client with a methodology note describing the exclusions.
This is a rational process. It is also, in important ways, the wrong solution to the right problem.
The Structural Flaw in Post-Fieldwork QC
Post-fieldwork data cleaning assumes that fraudulent responses can be identified and removed without affecting the integrity of the remaining data. This assumption is partially true for simple fraud — speeders and straightliners removed from a large dataset leave the remainder largely intact.
It is not true for structural fraud. When a significant proportion of the respondents in a quota cell are fraudulent — VPN users pretending to be in the right geography, profile fraudsters claiming to be the right demographic — removing them does not restore the data. It creates a dataset with insufficient sample in those cells.
The choices at that point are: re-field those cells at additional cost and time, report findings with insufficient sample with appropriate caveats that undermine the research value, or present findings that are structurally compromised. None of these options is acceptable. All of them are currently happening.
The Quota Contamination Problem
There is a subtler problem with post-fieldwork cleaning that is less often discussed. When quota cells fill with a mixture of genuine and fraudulent respondents, the quota is marked as complete and suppliers stop sending respondents to that cell.
When the fraudulent respondents are subsequently removed, the cell is no longer complete. But the quota has already closed. Genuine respondents who could have filled the cell were turned away while fraudulent ones were counted.
The only way to prevent this is real-time detection: catching the fraud before the response is recorded, before the quota is incremented, before the genuine respondent is turned away.
Post-fieldwork QC is a recovery mechanism. Real-time detection is prevention. The difference between them is the difference between treating illness and preventing it.
SoftSight — SurveyGuard detects fraud before it enters your quota count. softsight.io