Data Documentation Template¶
Dataset Owner: Name and contact information
Document Version: Version controlling this document is highly recommended
Reviewers: List reviewers
Overview¶
Dataset Description¶
Write a short summary describing your dataset (limit 200 words). Include information about the content and topic of the data, sources and motivations for the dataset, benefits and the problems or use cases it is suitable for. For readers that only take 10 seconds to look at this data card, adding one good overview image might also make the difference between this data being discovered and going unnoticed.
For more tips on describing data, see Zalando Data Foundation's Quality of Data Descriptions!
Status¶
Status Date: YYYY-MM-DD
Status: specify one of:
- Under Preparation -- The dataset is still under active curation and is not yet ready for use due to active "dev" updates.
- Regularly Updated -- New versions of the dataset have been or will continue to be made available.
- Actively Maintained -- No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.
- Limited Maintenance -- The data will not be updated, but any technical issues will be addressed.
- Deprecated -- This dataset is obsolete or is no longer being maintained.
Relevant Links¶
Example references:
- GitHub Repository
- Paper/Documentation Link
- Initiative Demo
- Conference Talk
- API Link
Developers¶
- Name, Team
- Name, Team
Owner¶
- Team Name, Contact Person
Deployer instructions of Use¶
- Instructions for use for deployers:
Version Details¶
Data Versioning¶
(Article 11, paragraph 2(d))
Data Version Control Tools:
- Include a Data_versioning.md file to document changes
- DVC (Data Version Control): Tracks datasets, connects them to model versions, and integrates with Git.
- Git-LFS (Large File Storage): Stores large data files outside the Git repository.
Maintenance of Metadata and Schema Versioning¶
Why¶
Data formats, schema, and other metadata changes can impact downstream processes. Tracking these ensures transparency.
How¶
Create a data dictionary:
- Document dataset structure, column descriptions, data types, and relationships.
Track schema changes:
- Use tools to log schema evolution.
- Record changes as part of version control or data pipelines.
Save metadata alongside datasets:
- Include details like source, timestamp, description, version, and quality metrics.
Known Usages¶
Model(s)¶
| Model | Model Task | Purpose of Dataset Usage |
|---|---|---|
| Example Model 1 | Image Segmentation | Fairness evaluation |
| Example Model 2 | Skin Tone Classifier | Training and validation |
Note, this table does not have to be exhaustive. Dataset users and documentation consumers at large are highly encouraged to contribute known usages.
Application(s)¶
| Application | Brief Description | Purpose of Dataset Usage |
|---|---|---|
| Example Application 1 | Size and Fit Recommendations | Fairness Evaluation of end-to-end application pipeline |
Dataset Characteristics¶
Data Types: (e.g., images, text, audio, structured, unstructured data, personal data)
Size/Volume:
Number of Instances/Records:
Primary Use Case(s): Description of the main AI use cases that the dataset was designed for or is typically used in.
Associated AI System(s): List known AI system(s) that this dataset is or has been used in.
Number of Features/Attributes (if applicable):
Label Information (if applicable):
Geographical Scope: Geographic location(s) where the data was collected.
Date of Collection: Start and end date of data collection.
Data Origin and Source¶
Source(s): Provide information about where the data was sourced from (e.g.,public datasets, sensors, surveys, web scraping, crowdsourced).
Third-Party Data: Indicate if any part of the dataset was obtained from third parties, and if so, detail the legal agreements in place (license, usage rights, etc.).
Ethical Sourcing: Provide information on the ethical and legal compliance of the data collection process (e.g., informed consent, transparency to data subjects, and compliance with GDPR or other regulations).
Provenance¶
Describe the history and origin of the data.
Collection¶
Method(s) Used¶
Specify one or more of:
- API
- Artificially generated
- Crowdsourced - Internal Employee
- Crowdsourced - External Paid
- Crowdsourced - Volunteer
- Vendor collection efforts
- Scraped or crawled
- Survey, forms, or polls
- Interviews, focus groups
- Scientific experiment
- Taken from other existing datasets
- Unknown
- To be determined
- Others (please specify)
Methodology Detail(s)¶
Collection Type
Source: Describe here. Include links where available.
Platform: [Platform Name], Describe platform here. Include links where relevant.
Is this source considered sensitive or high-risk? [Yes/No]
Dates of Collection: [YYYY-MM -- YYYY-MM]
Update Frequency for collected data:
Select one for this collection type: yearly, quarterly, monthly, on demand, no changes, others, ....
Additional Links for this collection:
See section on Access, Rention, and Deletion
Additional Notes: Add here
Source Description(s)¶
- Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
- Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
- Source: Describe here. Include links, data examples, metrics, visualizations where relevant.
Additional Notes: Add here
Collection Cadence¶
Static: Data was collected once from single or multiple sources.
Streamed: Data is continuously acquired from single or multiple sources.
Dynamic: Data is updated regularly from single or multiple sources.
Others: Please specify
Data Pre-Processing¶
Data Cleaning¶
- Handling missing data: (e.g., removal, imputation method used)
- Outlier treatment: (e.g., detection and removal technique)
- Duplicates removal: (Yes/No)
- Error correction: (Manual/Automated, if applicable)
Data Transformation¶
- Normalization/Standardization: (Method used, e.g., min-max scaling)
- Encoding categorical data: (e.g., one-hot encoding, label encoding)
- Text/tokenization: (Applicable for NLP tasks)
Feature Engineering¶
- Feature selection: (e.g., methods used to select features)
- Feature extraction: (e.g., PCA, interaction terms)
- Newly created features: (List any)
Dimensionality Reduction¶
- Technique(s) used: (e.g., PCA, t-SNE)
- Number of dimensions after reduction: (Specify)
Data Augmentation¶
- Augmentation technique(s): (e.g., rotation, flipping for images)
Data Annotation and Labeling¶
- Annotation Process: Describe the process used to label or annotate the data (e.g., human labelers, automated, crowdsourcing).
- Annotation platform
- Validation: Explain any quality control mechanisms applied to ensure accurate labeling or annotation
- Inter-Annotator agreement
- Consensus process
-
Calibration rounds
-
Annotator Demographics (Location / Language / Expertise / Background)
Validation Types¶
Method(s)¶
Example= range and constraint validation, structured validation, consistency validation
Breakdown(s)¶
(Validation Type)
Number of Data Points Validated:
Description(s)¶
Sampling Methods¶
Method(s) Used¶
Characteristic(s)¶
Sampling Criteria¶
Description(s)¶
Dataset Distribution and Licensing¶
- Availability:
- Open/public or private dataset
- Dataset Documentation Link: (Link to further details if available)
- User Rights and Limitations:
Access, Retention, and Deletion¶
Access¶
Relevant Links¶
- [Link to filestore]
- [Link to governance processes for data access]
- ...
Data Security Classification in and out of scope delineation¶
Prerequisite(s)¶
For example:
This dataset requires membership in [specific] database groups:
- Complete the [Mandatory Training]
- Read [Data Usage Policy]
- Initiate a [Data Processing Request]
Retention¶
Duration¶
Specify duration in days, months, or years.
Reasons for Duration¶
...
Policy Summary¶
Policy: Add a link to the policy if it's standardized at your company
Data Risk Assessment¶
Describe the assessment of data risks:
- Foreseeable unintended outcomes or biases arising from dataset use.
- Sources of potential discrimination or harm.
Cybersecurity Measures¶
Data Security Measures¶
Data Storage¶
- Encryption: Use AES-256; detail key management (e.g., HSM, key rotation).
- Access Control: Implement role-based access and MFA.
- Backup: Document backup frequency, encryption, and recovery testing.
- Integrity Monitoring: Use hashes, checksums, or blockchain.
- Security: Describe server protections (e.g., restricted access).
Data Transfer¶
- Encryption in Transit: Specify TLS 1.3, IPsec configurations.
- Endpoint Security: Detail device verification and certificate pinning.
- API Security: Document authentication, rate-limiting, and channel encryption.
- Data Masking: Use pseudonymisation for sensitive data in transit.
Data Processing¶
- Secure Environments: Use containers, VMs, or trusted execution (e.g., Intel SGX).
- Audit Logs: Specify logging standards, retention, and tamper protection.
- Data Minimisation: Anonymise or limit collected data.
Standards Applied¶
Data post-market monitoring¶
-Data Drift Detection and Monitoring: Describe here what type of drift was identified (covariate drift, prior probability drift or concept drift)
-Audit Logs: Periodically perform manual or semi-automated reviews of data samples and log changes in the data as well as access patterns.
- Action plans implemented to address identified issues:.
EU Declaration of conformity¶
Standards applied¶
Documentation Metadata¶
Version¶
Template Version¶
Documentation Authors¶
- Name, Team: (Owner / Contributor / Manager)
- Name, Team: (Owner / Contributor / Manager)
- Name, Team: (Owner / Contributor / Manager)