Trazabilidad y buenas prácticas de datos de entrenamiento para IA en la administración local

Why training data traceability matters for public bodies

AI projects in city councils and other public entities rely on datasets to train, validate and test models. Without clear controls over which data are used, how they were obtained and who modifies them, risks multiply: GDPR violations, non-compliance with the Spanish National Security Framework (ENS — RD 311/2022), undetected biases, reproducibility issues and difficulties meeting the EU AI Act’s transparency requirements.

Traceability is not just a regulatory box to tick: it’s an operational practice that facilitates audits, reproducibility, debugging and responsible AI governance. Below is a practical approach aimed at technical teams and legal/operational managers.

Key principles to apply

Legality and minimization (GDPR): identify the legal basis (public interest, mission fulfilment) and limit data to the minimum necessary.
Security and protection (ENS RD 311/2022): apply appropriate technical and organizational controls according to asset and service classification.
Transparency and documentation (EU AI Act / transparency obligations): keep records that explain what data fed a model and for what purpose.
Reproducibility and quality control: dataset versions, metadata and tests that allow retraining or rolling back changes.

Practical steps to design traceable training data flows

1. Inventory and classify data

Centralized inventory (sheet or catalog): source, ownership, controller, purpose, sensitivity level (personal, special categories, aggregated).
Identify personal data: if present, document the legal basis and retention limits.

2. Define the operational legal framework

Establish the legal basis (GDPR article) — e.g. performance of a public task or public interest — and document it in an internal register.
Carry out a Data Protection Impact Assessment (DPIA) when the processing is high risk.
Ensure contracts and data processing agreements with providers (data processors) include security and subcontracting clauses.

3. Minimization and pseudonymization by design

Apply minimization principles: remove unnecessary variables before storage.
Pseudonymize personal data where possible; keep re-identification mappings under strict control.
Record transformations and anonymization algorithms to justify decisions in audits.

4. Annotation protocols and quality

Define annotation guidelines and create an annotator manual (instructions, examples, rules for resolving ambiguities).
Maintain a control dataset (gold standard) and inter-annotator agreement metrics.
Version annotations and keep histories (who annotated what and when).

5. Separation of environments and test data

Maintain isolated environments: development, validation, pre-production and production.
Do not use real citizen data in training environments without appropriate protections; consider synthetic datasets for integration testing.
Log samples used in tests and validation results.

6. Traceability, metadata and versioning

For each dataset, maintain a "dataset README" with: purpose, creation date, schema, source, applied transformations, version and owners.
Implement version control (git-lfs, DVC or others) for data and annotations.
Keep a data lineage log that links dataset versions to trained model versions.

7. Managing vendors and the data supply chain

Include clauses in contracts that require providers to supply metadata and guarantees of quality and legality.
Periodically audit data and annotation providers, and request documentary evidence (procedures, annotator training, quality controls).

8. Continuous auditing and bias metrics

Implement periodic tests to detect bias and performance degradation due to data drift.
Keep metrics and reports available for internal audits and to meet transparency obligations where the EU AI Act applies.

Brief example of minimum artifacts to create

Dataset catalog (spreadsheet/portal) with mandatory fields.
Dataset README / metadata sheet per dataset.
Annotation manual and quality log (IAA).
Versioning system (DVC or similar) and lineage record.
DPIA template adapted to AI and a model contractual clause for processors.

Pragmatic implementation in 90 days

Weeks 1–4: inventory and classify the datasets in use.
Weeks 5–8: define legal bases and complete DPIAs for priority cases.
Weeks 9–12: establish minimum versioning and a dataset README for projects in production; define responsibilities (data steward).

Tools and deployment can vary; platforms like OptimGov already integrate some traceability controls and can speed adoption, but the critical point is organizational discipline and documentation.

Takeaway / Immediate action

Today, create a basic inventory and a metadata sheet for your first AI dataset. Assign an owner (data steward), document the legal basis and record the dataset’s first version with its README. That small step reduces legal risk, eases audits and is the foundation for building full traceability.

Quick normative references: GDPR (principles and DPIA), ENS (RD 311/2022) and the EU AI Act’s documentation obligations for high-risk systems.

Traceability and Good Data Practices for AI Training in Local Government