ποΈ Best Practices for Metadata
π What is Metadata?
Metadata is data about data β it provides the context that makes research data understandable and reusable.
It describes the who, what, where, when, how, and why of data collection.

Metadata also informs users about:
- π Constraints β limitations on use or sharing
- π Update frequency β how often data is refreshed
- π Interoperability β standardized terms that make data discoverable across repositories
π Metadata CanWIN Collects
CanWIN Metadata
- Title
- Summary
- Location
- Date
- Authors and affiliations
- Keywords
- Licensing information and terms of use/access
- Data status and versions
- Update and maintenance frequency
- Data Type (dataset, report, model etc)
- Sample and analytical methods (steps or methods to collect and process data)
- Instruments & deployment details (instrument type, sensors, deployment dates and locations, etc)
- Related resources
- Awards & Funding information
- Website
- Theme (marine, atmospheric, freshwater, cryosphere, remote sensing)
- Variable descriptions (names, units, media)
π¬ Best Practices for Metadata Submitters (Researchers, Data Providers)
- Use precise keywords β 4β6 wellβchosen terms improve findability.
- Include provenance β who collected the data, when, and where.
- Record data collection & processing steps β document methods for transparency and reuse.
- Create a data dictionary β define variables - units and descriptions.
- Keep metadata current β update when methods, instruments, or contributors change.
- Respect Indigenous Data Sovereignty β apply CARE and OCAP principles when relevant.
- Provide licensing & access rights β specify Creative Commons or institutional licenses.
- Write a meaningful dataset description β align with FAIR principles by ensuring the description is:
- More than 50 characters
- Written in plain language (accessible to nonβspecialists)
- If possible, includes scope, purpose, and context (what the data represents, why it was collected)
- Avoids jargon or unexplained acronyms
π’ Best Practices for Data Centers / Curators
- Ensure interoperability β export metadata in multiple formats (
JSON,RDF/XML,HTML,PDF). - Apply controlled vocabularies β align with standards (ISO 19115, GCMD keywords).
- Link related resources β connect datasets to publications, instruments, campaigns.
- Assign persistent identifiers (DOIs/Handles) β ensure datasets are citable and traceable.
- Maintain update frequency records β indicate whether data is static, ongoing, or regularly updated.
- Validate metadata quality β check for completeness, consistency, and compliance with FAIR principles.
- Provide longβterm preservation β ensure metadata remains accessible even if datasets are retired.
π Data Dictionary, Codebooks, and Cookbooks
Beyond metadata fields, three documentation tools strengthen the understandability and reproducibility of your datasets:
π Data Dictionary
A data dictionary defines the terms in your data files and applies common names to variables so that your data is understandable to others.
It should include variable names, units, and clear descriptions.
β Best practice: Always provide a data dictionary alongside your dataset. This ensures that future users (including yourself!) can interpret variables correctly.
Template (downloadable):
π₯ Data Dictionary Template
π Data Dictionary Example
| Variable name | Common name | Units | Description |
|---|---|---|---|
| T_C | Temperature | Β°C | Water temperature measured at depth |
| Salinity | Salinity | PSU | Practical salinity units |
| O2_mgL | Dissolved Oxygen | mg/L | Oxygen concentration in water |
π οΈ Codebook
A codebook describes the key functions, modules, or scripts used to process the data.
It documents the logic behind transformations, cleaning steps, and analysis routines.
β Best practice: Include a codebook whenever scripts are used to process or analyze data. This makes workflows transparent and easier to reproduce.
Template (downloadable):
π₯ Codebook Template
π οΈ Codebook Example
| Function/Module | Purpose | Input | Output | Notes |
|---|---|---|---|---|
| clean_data() | Removes NULL values | raw.csv | clean.csv | Applied before analysis |
| normalize_units() | Standardizes units | clean.csv | norm.csv | Converts to SI units |
| merge_metadata() | Adds location + instrument info | norm.csv + metadata.json | final.csv | Ensures provenance |
π³ Cookbook
A cookbook describes the data retrieval and processing steps in a workflow, step by step.
Itβs essentially a recipe for reproducing your dataset preparation.
β Best practice: Provide a cookbook for complex workflows, especially when multiple tools or scripts are involved.
Template (downloadable):
π₯ Cookbook Template
π³ Cookbook Example
Dataset Cookbook
1. Data Retrieval
- Instrument: CTD profiler (SeaBird SBE 19+)
- Deployment details: Station 12, Arctic Ocean, July 2016
- Raw file format: .hex converted to .cnv
- Raw variables:
- TEMP (Β°C)
- SAL (PSU)
- OXYGEN (mg/L)
- DEPTH (m)
2. Raw β Processed Variables (Specific processing for each variable - Optional)
| Raw Variable | Processing Step | Final Variable in Data Dictionary |
|---|---|---|
| TEMP | QC filter, converted to UTC timestamps | Temperature (Β°C) |
| SAL | Flagged values removed, standardized units | Salinity (PSU) |
| OXYGEN | Calibration applied, converted to Β΅mol/kg | Dissolved Oxygen |
| DEPTH | Pressure converted to depth | Depth (m) |
3. Processing Workflow
- Retrieve raw instrument files from onboard logger.
- Convert proprietary format (.hex) to ASCII (.cnv).
- Apply manufacturer calibration coefficients.
- Remove flagged/bad values.
- Convert timestamps to UTC.
- Standardize units to SI.
- Merge with metadata (station, instrument ID, deployment date).
- Export final dataset as CSV.
Tip
Think of these three tools as complementary:
- Data Dictionary β defines your variables
- Codebook β explains your scripts
- Cookbook β documents your workflow
Together, they make your dataset FAIR and reproducible.