EU AI Act Article 10: Data Governance Requirements for High-Risk AI Systems
Key takeaways
- -Article 10 requires documented data governance practices covering collection, preparation, labelling, and bias examination for all training and testing datasets.
- -High-risk AI providers must use datasets that are relevant, representative, and as free of errors as possible — and must document known gaps and limitations.
- -Data governance obligations are ongoing — they apply to initial training and every subsequent update to the AI system's data pipeline.
Article 10 of the EU AI Act is where the regulation gets deeply technical. It sets requirements for the data used to train, validate, and test high-risk AI systems — covering everything from data collection practices to bias examination. For AI providers, this article defines what "good enough" data looks like under EU law.
If your AI system is classified as high-risk under Annex III, Article 10 applies to every dataset that feeds your model — training data, validation data, and testing data. The obligations are ongoing: they apply to your initial development and to every subsequent update.
What Article 10 requires
Article 10 requires high-risk AI providers to implement data governance and management practices. Specifically, you must address:
- Design choices. Document the decisions behind your data strategy — why you chose specific data sources, what alternatives you considered, and how your dataset design relates to the system's intended purpose.
- Data collection processes. Document how data was collected, from whom, when, and under what conditions. This includes whether data was purchased, scraped, donated, generated, or synthesised.
- Data preparation. Document all processing operations — annotation, labelling, cleaning, enrichment, aggregation. Include who performed these operations and what quality controls were applied.
- Relevance and representativeness. Your datasets must be relevant to the system's intended purpose and sufficiently representative of the population or scenarios the system will encounter.
- Data gaps and known limitations. You must identify, document, and address any gaps in your data that could affect the system's performance or introduce bias.
Data quality criteria
Article 10(3) sets out specific quality criteria that training, validation, and testing datasets must meet:
- Relevant. The data must relate to the system's intended purpose and operational context. A credit scoring model trained only on data from one country may not be relevant for deployment across the EU.
- Sufficiently representative. The dataset must reflect the diversity of the population and scenarios the system will encounter. Underrepresentation of specific groups is a compliance risk.
- Free of errors, as far as possible. The Act recognises perfect data doesn't exist. You must implement reasonable quality controls and document remaining data quality issues.
- Complete. The dataset must have the properties appropriate for the intended purpose. Missing features or insufficient coverage of edge cases should be identified and documented.
Warning
Bias examination and mitigation
Article 10(2)(f) explicitly requires examination of datasets for possible biases that could lead to discrimination. This is not optional — it is a legal requirement for every high-risk AI system.
In practice, bias examination means:
- Statistical analysis of protected characteristics. Examine your data distribution across categories like gender, age, ethnicity, and disability where relevant to your system's purpose. Document any imbalances.
- Proxy variable identification. Identify features that may serve as proxies for protected characteristics — postal codes correlating with ethnicity, name patterns correlating with gender, etc.
- Output fairness testing. Test your model's outputs across different demographic groups. Measure disparate impact using appropriate fairness metrics for your use case.
- Mitigation documentation. When you identify bias, document the mitigation steps you took — resampling, reweighting, adversarial debiasing, or post-processing adjustments — and the results.
Tip
Documentation requirements
Annex IV requires detailed documentation of your data governance practices. You must be able to provide:
- Dataset descriptions. For each dataset: source, size, format, collection period, geographical coverage, demographic coverage, and known limitations.
- Data processing documentation. Every transformation applied to the data — cleaning rules, feature engineering, augmentation, anonymisation — with version history.
- Bias analysis reports. Results of your bias examination, including statistical analyses, identified issues, mitigation steps taken, and residual risks accepted.
- Data quality metrics. How you measured data quality, what thresholds you set, and evidence of compliance with those thresholds.
- Special categories of personal data. If your system processes special categories under GDPR (Article 9 data), document the legal basis and specific safeguards in place. Article 10(5) allows limited processing of special category data specifically for bias monitoring, subject to strict conditions.
Practical implementation
Implementing Article 10 data governance in practice:
- Create a data governance policy. A single document covering your data management practices for AI systems. This should reference your GDPR processes where they overlap but add AI-specific requirements.
- Implement data lineage tracking. You need to trace every piece of training data back to its source. Tools like DVC, MLflow, or custom metadata pipelines can help. The key is being able to answer: where did this data come from, and what happened to it?
- Build bias testing into your CI/CD pipeline. Don't make bias examination a one-time exercise. Automated fairness checks on every model update catch drift before it becomes a compliance issue.
- Use datasheets for datasets. The academic concept of "datasheets for datasets" maps closely to what Article 10 requires. For each dataset, maintain a structured document covering purpose, composition, collection, processing, and known issues.
- Start with your risk classification. If your system isn't high-risk, Article 10 doesn't apply directly. But the practices it describes are good data governance regardless.
Article 10 data governance is one of the more technically demanding EU AI Act obligations. The high-risk deadline is 559 days away. Start documenting your data practices now — retroactively reconstructing data lineage is significantly harder than tracking it from the start.
Related articles
Stay ahead of the deadline
Get EU AI Act updates, enforcement news, and compliance guides delivered to your inbox. No spam — unsubscribe any time.
Check your AI system's risk level for free
Our classifier maps your AI system against the EU AI Act in under 60 seconds. No signup required.
Classify Your AI System