Data Documentation Template

EU AI Act Article 10
EU AI Act Article 11 ; Annex IV paragraph 1, 2 (d)

Dataset Owner: Name and contact information
Document Version: Version controlling this document is highly recommended
Reviewers: List reviewers

Overview

Dataset Description

EU AI Act Article 11 ; Annex IV paragraph 1, 2(d)

Write a short summary describing your dataset (limit 200 words). Include information about the content and topic of the data, sources and motivations for the dataset, benefits and the problems or use cases it is suitable for. For readers that only take 10 seconds to look at this data card, adding one good overview image might also make the difference between this data being discovered and going unnoticed.

For more tips on describing data, see Zalando Data Foundation's Quality of Data Descriptions!

Status

Status Date: YYYY-MM-DD

Status: specify one of:

  • Under Preparation -- The dataset is still under active curation and is not yet ready for use due to active "dev" updates.
  • Regularly Updated -- New versions of the dataset have been or will continue to be made available.
  • Actively Maintained -- No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.
  • Limited Maintenance -- The data will not be updated, but any technical issues will be addressed.
  • Deprecated -- This dataset is obsolete or is no longer being maintained.

Example references:

  • GitHub Repository
  • Paper/Documentation Link
  • Initiative Demo
  • Conference Talk
  • API Link

Developers

  • Name, Team
  • Name, Team

Owner

  • Team Name, Contact Person

Deployer instructions of Use

  • Instructions for use for deployers:
EU AI Act Article 13

Version Details

Data Versioning

(Article 11, paragraph 2(d))

Data Version Control Tools:

  • Include a Data_versioning.md file to document changes
  • DVC (Data Version Control): Tracks datasets, connects them to model versions, and integrates with Git.
  • Git-LFS (Large File Storage): Stores large data files outside the Git repository.

Maintenance of Metadata and Schema Versioning

EU AI Act Article 11 ; Annex IV paragraph 3

Why

Data formats, schema, and other metadata changes can impact downstream processes. Tracking these ensures transparency.

How

Create a data dictionary:

  • Document dataset structure, column descriptions, data types, and relationships.

Track schema changes:

  • Use tools to log schema evolution.
  • Record changes as part of version control or data pipelines.

Save metadata alongside datasets:

  • Include details like source, timestamp, description, version, and quality metrics.

Known Usages

EU AI Act Article 11 ; Annex IV paragraph 3

Model(s)

Model Model Task Purpose of Dataset Usage
Example Model 1 Image Segmentation Fairness evaluation
Example Model 2 Skin Tone Classifier Training and validation

Note, this table does not have to be exhaustive. Dataset users and documentation consumers at large are highly encouraged to contribute known usages.

Application(s)

Application Brief Description Purpose of Dataset Usage
Example Application 1 Size and Fit Recommendations Fairness Evaluation of end-to-end application pipeline

Dataset Characteristics

EU AI Act Article 11 ; Annex IV paragraph 2(d)

Data Types: (e.g., images, text, audio, structured, unstructured data, personal data)
Size/Volume:
Number of Instances/Records:
Primary Use Case(s): Description of the main AI use cases that the dataset was designed for or is typically used in.
Associated AI System(s): List known AI system(s) that this dataset is or has been used in.
Number of Features/Attributes (if applicable):
Label Information (if applicable):
Geographical Scope: Geographic location(s) where the data was collected.
Date of Collection: Start and end date of data collection.

Data Origin and Source

Source(s): Provide information about where the data was sourced from (e.g.,public datasets, sensors, surveys, web scraping, crowdsourced).
Third-Party Data: Indicate if any part of the dataset was obtained from third parties, and if so, detail the legal agreements in place (license, usage rights, etc.).
Ethical Sourcing: Provide information on the ethical and legal compliance of the data collection process (e.g., informed consent, transparency to data subjects, and compliance with GDPR or other regulations).

Provenance

Describe the history and origin of the data.

Collection

Method(s) Used

Specify one or more of:

  • API
  • Artificially generated
  • Crowdsourced - Internal Employee
  • Crowdsourced - External Paid
  • Crowdsourced - Volunteer
  • Vendor collection efforts
  • Scraped or crawled
  • Survey, forms, or polls
  • Interviews, focus groups
  • Scientific experiment
  • Taken from other existing datasets
  • Unknown
  • To be determined
  • Others (please specify)

Methodology Detail(s)

EU AI Act Article 11 ; Annex IV 2 (a), (b), (d)

Collection Type

Source: Describe here. Include links where available.

Platform: [Platform Name], Describe platform here. Include links where relevant.

Is this source considered sensitive or high-risk? [Yes/No]

Dates of Collection: [YYYY-MM -- YYYY-MM]

Update Frequency for collected data:

Select one for this collection type: yearly, quarterly, monthly, on demand, no changes, others, ....

Additional Links for this collection:

See section on Access, Rention, and Deletion

Additional Notes: Add here

Source Description(s)

  • Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
  • Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
  • Source: Describe here. Include links, data examples, metrics, visualizations where relevant.

Additional Notes: Add here

Collection Cadence

Static: Data was collected once from single or multiple sources.

Streamed: Data is continuously acquired from single or multiple sources.

Dynamic: Data is updated regularly from single or multiple sources.

Others: Please specify

Data Pre-Processing

EU AI Act Article 11 ; Annex IV paragraph 2 (d, e)

Data Cleaning

  • Handling missing data: (e.g., removal, imputation method used)
  • Outlier treatment: (e.g., detection and removal technique)
  • Duplicates removal: (Yes/No)
  • Error correction: (Manual/Automated, if applicable)

Data Transformation

  • Normalization/Standardization: (Method used, e.g., min-max scaling)
  • Encoding categorical data: (e.g., one-hot encoding, label encoding)
  • Text/tokenization: (Applicable for NLP tasks)

Feature Engineering

  • Feature selection: (e.g., methods used to select features)
  • Feature extraction: (e.g., PCA, interaction terms)
  • Newly created features: (List any)

Dimensionality Reduction

  • Technique(s) used: (e.g., PCA, t-SNE)
  • Number of dimensions after reduction: (Specify)

Data Augmentation

  • Augmentation technique(s): (e.g., rotation, flipping for images)

Data Annotation and Labeling

EU AI Act Article 11 ; Annex IV 2(d)

  • Annotation Process: Describe the process used to label or annotate the data (e.g., human labelers, automated, crowdsourcing).
  • Annotation platform
  • Validation: Explain any quality control mechanisms applied to ensure accurate labeling or annotation
  • Inter-Annotator agreement
  • Consensus process
  • Calibration rounds

  • Annotator Demographics (Location / Language / Expertise / Background)

Validation Types

Method(s)

Example= range and constraint validation, structured validation, consistency validation

Breakdown(s)

(Validation Type)

Number of Data Points Validated: 

Description(s)

Sampling Methods

Method(s) Used

Characteristic(s)

Sampling Criteria

Description(s)

Dataset Distribution and Licensing

EU AI Act Article 11 ; Annex IV 2(d)

  • Availability:
  • Open/public or private dataset
  • Dataset Documentation Link: (Link to further details if available)
  • User Rights and Limitations:

Access, Retention, and Deletion

Access

  • [Link to filestore]
  • [Link to governance processes for data access]
  • ...

Data Security Classification in and out of scope delineation

Prerequisite(s)

For example:

This dataset requires membership in [specific] database groups:

  • Complete the [Mandatory Training]
  • Read [Data Usage Policy]
  • Initiate a [Data Processing Request]

Retention

Duration

Specify duration in days, months, or years.

Reasons for Duration

...

Policy Summary

Policy: Add a link to the policy if it's standardized at your company

Data Risk Assessment

Describe the assessment of data risks:

  • Foreseeable unintended outcomes or biases arising from dataset use.
  • Sources of potential discrimination or harm.

Cybersecurity Measures

EU AI Act Article 11 ; Annex IV paragraph 5

Data Security Measures

Data Storage

  • Encryption: Use AES-256; detail key management (e.g., HSM, key rotation).
  • Access Control: Implement role-based access and MFA.
  • Backup: Document backup frequency, encryption, and recovery testing.
  • Integrity Monitoring: Use hashes, checksums, or blockchain.
  • Security: Describe server protections (e.g., restricted access).

Data Transfer

  • Encryption in Transit: Specify TLS 1.3, IPsec configurations.
  • Endpoint Security: Detail device verification and certificate pinning.
  • API Security: Document authentication, rate-limiting, and channel encryption.
  • Data Masking: Use pseudonymisation for sensitive data in transit.

Data Processing

  • Secure Environments: Use containers, VMs, or trusted execution (e.g., Intel SGX).
  • Audit Logs: Specify logging standards, retention, and tamper protection.
  • Data Minimisation: Anonymise or limit collected data.

Standards Applied

Data post-market monitoring

-Data Drift Detection and Monitoring: Describe here what type of drift was identified (covariate drift, prior probability drift or concept drift)

-Audit Logs: Periodically perform manual or semi-automated reviews of data samples and log changes in the data as well as access patterns.

  • Action plans implemented to address identified issues:.

EU Declaration of conformity

EU AI Act Article 47

Standards applied

Documentation Metadata

Version

Template Version

Documentation Authors

  • Name, Team: (Owner / Contributor / Manager)
  • Name, Team: (Owner / Contributor / Manager)
  • Name, Team: (Owner / Contributor / Manager)