Data Documentation Template¶

EU AI Act Article 10
EU AI Act Article 11 ; Annex IV paragraph 1, 2 (d)

Dataset Owner: Name and contact information
Document Version: Version controlling this document is highly recommended
Reviewers: List reviewers

Overview¶

Dataset Description¶

EU AI Act Article 11 ; Annex IV paragraph 1, 2(d)

Write a short summary describing your dataset (limit 200 words). Include information about the content and topic of the data, sources and motivations for the dataset, benefits and the problems or use cases it is suitable for. For readers that only take 10 seconds to look at this data card, adding one good overview image might also make the difference between this data being discovered and going unnoticed.

For more tips on describing data, see Zalando Data Foundation's Quality of Data Descriptions!

Status¶

Status Date: YYYY-MM-DD

Status: specify one of:

Under Preparation -- The dataset is still under active curation and is not yet ready for use due to active "dev" updates.
Regularly Updated -- New versions of the dataset have been or will continue to be made available.
Actively Maintained -- No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.
Limited Maintenance -- The data will not be updated, but any technical issues will be addressed.
Deprecated -- This dataset is obsolete or is no longer being maintained.

Relevant Links¶

Example references:

GitHub Repository
Paper/Documentation Link
Initiative Demo
Conference Talk
API Link

Developers¶

Name, Team
Name, Team

Owner¶

Team Name, Contact Person

Deployer instructions of Use¶

Instructions for use for deployers:

EU AI Act Article 13

Version Details¶

Data Versioning¶

(Article 11, paragraph 2(d))

Data Version Control Tools:

Include a Data_versioning.md file to document changes
DVC (Data Version Control): Tracks datasets, connects them to model versions, and integrates with Git.
Git-LFS (Large File Storage): Stores large data files outside the Git repository.

Maintenance of Metadata and Schema Versioning¶

EU AI Act Article 11 ; Annex IV paragraph 3

Why¶

Data formats, schema, and other metadata changes can impact downstream processes. Tracking these ensures transparency.

How¶

Create a data dictionary:

Document dataset structure, column descriptions, data types, and relationships.

Track schema changes:

Use tools to log schema evolution.
Record changes as part of version control or data pipelines.

Save metadata alongside datasets:

Include details like source, timestamp, description, version, and quality metrics.

Known Usages¶

EU AI Act Article 11 ; Annex IV paragraph 3

Model(s)¶

Model	Model Task	Purpose of Dataset Usage
Example Model 1	Image Segmentation	Fairness evaluation
Example Model 2	Skin Tone Classifier	Training and validation

Note, this table does not have to be exhaustive. Dataset users and documentation consumers at large are highly encouraged to contribute known usages.

Application(s)¶

Application	Brief Description	Purpose of Dataset Usage
Example Application 1	Size and Fit Recommendations	Fairness Evaluation of end-to-end application pipeline

Dataset Characteristics¶

EU AI Act Article 11 ; Annex IV paragraph 2(d)

Data Types: (e.g., images, text, audio, structured, unstructured data, personal data)
Size/Volume:
Number of Instances/Records:
Primary Use Case(s): Description of the main AI use cases that the dataset was designed for or is typically used in.
Associated AI System(s): List known AI system(s) that this dataset is or has been used in.
Number of Features/Attributes (if applicable):
Label Information (if applicable):
Geographical Scope: Geographic location(s) where the data was collected.
Date of Collection: Start and end date of data collection.

Data Origin and Source¶

Source(s): Provide information about where the data was sourced from (e.g.,public datasets, sensors, surveys, web scraping, crowdsourced).
Third-Party Data: Indicate if any part of the dataset was obtained from third parties, and if so, detail the legal agreements in place (license, usage rights, etc.).
Ethical Sourcing: Provide information on the ethical and legal compliance of the data collection process (e.g., informed consent, transparency to data subjects, and compliance with GDPR or other regulations).

Provenance¶

Describe the history and origin of the data.

Collection¶

Method(s) Used¶

Specify one or more of:

API
Artificially generated
Crowdsourced - Internal Employee
Crowdsourced - External Paid
Crowdsourced - Volunteer
Vendor collection efforts
Scraped or crawled
Survey, forms, or polls
Interviews, focus groups
Scientific experiment
Taken from other existing datasets
Unknown
To be determined
Others (please specify)

Methodology Detail(s)¶

EU AI Act Article 11 ; Annex IV 2 (a), (b), (d)

Collection Type

Source: Describe here. Include links where available.

Platform: [Platform Name], Describe platform here. Include links where relevant.

Is this source considered sensitive or high-risk? [Yes/No]

Dates of Collection: [YYYY-MM -- YYYY-MM]

Update Frequency for collected data:

Select one for this collection type: yearly, quarterly, monthly, on demand, no changes, others, ....

Additional Links for this collection:

See section on Access, Rention, and Deletion

Additional Notes: Add here

Source Description(s)¶

Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
Source: Describe here. Include links, data examples, metrics, visualizations where relevant.

Additional Notes: Add here

Collection Cadence¶

Static: Data was collected once from single or multiple sources.

Streamed: Data is continuously acquired from single or multiple sources.

Dynamic: Data is updated regularly from single or multiple sources.

Others: Please specify

Data Pre-Processing¶

EU AI Act Article 11 ; Annex IV paragraph 2 (d, e)

Data Cleaning¶

Handling missing data: (e.g., removal, imputation method used)
Outlier treatment: (e.g., detection and removal technique)
Duplicates removal: (Yes/No)
Error correction: (Manual/Automated, if applicable)

Data Transformation¶

Normalization/Standardization: (Method used, e.g., min-max scaling)
Encoding categorical data: (e.g., one-hot encoding, label encoding)
Text/tokenization: (Applicable for NLP tasks)

Feature Engineering¶

Feature selection: (e.g., methods used to select features)
Feature extraction: (e.g., PCA, interaction terms)
Newly created features: (List any)

Dimensionality Reduction¶

Technique(s) used: (e.g., PCA, t-SNE)
Number of dimensions after reduction: (Specify)

Data Augmentation¶

Augmentation technique(s): (e.g., rotation, flipping for images)

Data Annotation and Labeling¶

EU AI Act Article 11 ; Annex IV 2(d)

Annotation Process: Describe the process used to label or annotate the data (e.g., human labelers, automated, crowdsourcing).
Annotation platform
Validation: Explain any quality control mechanisms applied to ensure accurate labeling or annotation
Inter-Annotator agreement
Consensus process
Calibration rounds
Annotator Demographics (Location / Language / Expertise / Background)

Validation Types¶

Method(s)¶

Example= range and constraint validation, structured validation, consistency validation

Breakdown(s)¶

(Validation Type)

Number of Data Points Validated:

Description(s)¶

Sampling Methods¶

Method(s) Used¶

Characteristic(s)¶

Sampling Criteria¶

Description(s)¶

Dataset Distribution and Licensing¶

EU AI Act Article 11 ; Annex IV 2(d)

Availability:
Open/public or private dataset
Dataset Documentation Link: (Link to further details if available)
User Rights and Limitations:

Access, Retention, and Deletion¶

Access¶

Relevant Links¶

[Link to filestore]
[Link to governance processes for data access]
...

Data Security Classification in and out of scope delineation¶

Prerequisite(s)¶

For example:

This dataset requires membership in [specific] database groups:

Complete the [Mandatory Training]
Read [Data Usage Policy]
Initiate a [Data Processing Request]

Retention¶

Duration¶

Specify duration in days, months, or years.

Reasons for Duration¶

...

Policy Summary¶

Policy: Add a link to the policy if it's standardized at your company

Data Risk Assessment¶

Describe the assessment of data risks:

Foreseeable unintended outcomes or biases arising from dataset use.
Sources of potential discrimination or harm.

Cybersecurity Measures¶

EU AI Act Article 11 ; Annex IV paragraph 5

Data Security Measures¶

Data Storage¶

Encryption: Use AES-256; detail key management (e.g., HSM, key rotation).
Access Control: Implement role-based access and MFA.
Backup: Document backup frequency, encryption, and recovery testing.
Integrity Monitoring: Use hashes, checksums, or blockchain.
Security: Describe server protections (e.g., restricted access).

Data Transfer¶

Encryption in Transit: Specify TLS 1.3, IPsec configurations.
Endpoint Security: Detail device verification and certificate pinning.
API Security: Document authentication, rate-limiting, and channel encryption.
Data Masking: Use pseudonymisation for sensitive data in transit.

Data Processing¶

Secure Environments: Use containers, VMs, or trusted execution (e.g., Intel SGX).
Audit Logs: Specify logging standards, retention, and tamper protection.
Data Minimisation: Anonymise or limit collected data.

Standards Applied¶

Data post-market monitoring¶

-Data Drift Detection and Monitoring: Describe here what type of drift was identified (covariate drift, prior probability drift or concept drift)

-Audit Logs: Periodically perform manual or semi-automated reviews of data samples and log changes in the data as well as access patterns.

Data Documentation Template¶

Overview¶

Dataset Description¶

Status¶

Relevant Links¶

Developers¶

Owner¶

Deployer instructions of Use¶

Version Details¶

Data Versioning¶

Maintenance of Metadata and Schema Versioning¶

Why¶

How¶

Known Usages¶

Model(s)¶

Application(s)¶

Dataset Characteristics¶

Data Origin and Source¶

Provenance¶

Collection¶

Method(s) Used¶

Methodology Detail(s)¶

Source Description(s)¶

Collection Cadence¶

Data Pre-Processing¶

Data Cleaning¶

Data Transformation¶

Feature Engineering¶

Dimensionality Reduction¶

Data Augmentation¶

Data Annotation and Labeling¶

Validation Types¶

Method(s)¶

Breakdown(s)¶

Description(s)¶

Sampling Methods¶

Method(s) Used¶

Characteristic(s)¶

Sampling Criteria¶

Description(s)¶

Dataset Distribution and Licensing¶

Access, Retention, and Deletion¶

Access¶

Relevant Links¶

Data Security Classification in and out of scope delineation¶

Prerequisite(s)¶

Retention¶

Duration¶

Reasons for Duration¶

Policy Summary¶

Data Risk Assessment¶

Cybersecurity Measures¶

Data Security Measures¶

Data Storage¶

Data Transfer¶

Data Processing¶

Standards Applied¶

Data post-market monitoring¶

EU Declaration of conformity¶

Standards applied¶

Documentation Metadata¶

Version¶

Template Version¶

Documentation Authors¶