Skip to content

πŸ—‚οΈ Best Practices for Metadata

πŸ“– What is Metadata?

Metadata is data about data β€” it provides the context that makes research data understandable and reusable.
It describes the who, what, where, when, how, and why of data collection.

Metadata also informs users about:

  • πŸ”’ Constraints β†’ limitations on use or sharing
  • πŸ”„ Update frequency β†’ how often data is refreshed
  • 🌐 Interoperability β†’ standardized terms that make data discoverable across repositories

πŸ“‹ Metadata CanWIN Collects

CanWIN Metadata

  • Title
  • Summary
  • Location
  • Date
  • Authors and affiliations
  • Keywords
  • Licensing information and terms of use/access
  • Data status and versions
  • Update and maintenance frequency
  • Data Type (dataset, report, model etc)
  • Sample and analytical methods (steps or methods to collect and process data)
  • Instruments & deployment details (instrument type, sensors, deployment dates and locations, etc)
  • Related resources
  • Awards & Funding information
  • Website
  • Theme (marine, atmospheric, freshwater, cryosphere, remote sensing)
  • Variable descriptions (names, units, media)

πŸ”¬ Best Practices for Metadata Submitters (Researchers, Data Providers)

  • Use precise keywords β†’ 4–6 well‑chosen terms improve findability.
  • Include provenance β†’ who collected the data, when, and where.
  • Record data collection & processing steps β†’ document methods for transparency and reuse.
  • Create a data dictionary β†’ define variables - units and descriptions.
  • Keep metadata current β†’ update when methods, instruments, or contributors change.
  • Respect Indigenous Data Sovereignty β†’ apply CARE and OCAP principles when relevant.
  • Provide licensing & access rights β†’ specify Creative Commons or institutional licenses.
  • Write a meaningful dataset description β†’ align with FAIR principles by ensuring the description is:
  • More than 50 characters
  • Written in plain language (accessible to non‑specialists)
  • If possible, includes scope, purpose, and context (what the data represents, why it was collected)
  • Avoids jargon or unexplained acronyms

🏒 Best Practices for Data Centers / Curators

  • Ensure interoperability β†’ export metadata in multiple formats (JSON, RDF/XML, HTML, PDF).
  • Apply controlled vocabularies β†’ align with standards (ISO 19115, GCMD keywords).
  • Link related resources β†’ connect datasets to publications, instruments, campaigns.
  • Assign persistent identifiers (DOIs/Handles) β†’ ensure datasets are citable and traceable.
  • Maintain update frequency records β†’ indicate whether data is static, ongoing, or regularly updated.
  • Validate metadata quality β†’ check for completeness, consistency, and compliance with FAIR principles.
  • Provide long‑term preservation β†’ ensure metadata remains accessible even if datasets are retired.

πŸ“‘ Data Dictionary, Codebooks, and Cookbooks

Beyond metadata fields, three documentation tools strengthen the understandability and reproducibility of your datasets:


πŸ“– Data Dictionary

A data dictionary defines the terms in your data files and applies common names to variables so that your data is understandable to others.
It should include variable names, units, and clear descriptions.

βœ… Best practice: Always provide a data dictionary alongside your dataset. This ensures that future users (including yourself!) can interpret variables correctly.

Template (downloadable):
πŸ“₯ Data Dictionary Template

πŸ“– Data Dictionary Example
Variable name Common name Units Description
T_C Temperature Β°C Water temperature measured at depth
Salinity Salinity PSU Practical salinity units
O2_mgL Dissolved Oxygen mg/L Oxygen concentration in water

πŸ› οΈ Codebook

A codebook describes the key functions, modules, or scripts used to process the data.
It documents the logic behind transformations, cleaning steps, and analysis routines.

βœ… Best practice: Include a codebook whenever scripts are used to process or analyze data. This makes workflows transparent and easier to reproduce.

Template (downloadable):
πŸ“₯ Codebook Template

πŸ› οΈ Codebook Example
Function/Module Purpose Input Output Notes
clean_data() Removes NULL values raw.csv clean.csv Applied before analysis
normalize_units() Standardizes units clean.csv norm.csv Converts to SI units
merge_metadata() Adds location + instrument info norm.csv + metadata.json final.csv Ensures provenance

🍳 Cookbook

A cookbook describes the data retrieval and processing steps in a workflow, step by step.
It’s essentially a recipe for reproducing your dataset preparation.

βœ… Best practice: Provide a cookbook for complex workflows, especially when multiple tools or scripts are involved.

Template (downloadable):
πŸ“₯ Cookbook Template

🍳 Cookbook Example

Dataset Cookbook

1. Data Retrieval

  • Instrument: CTD profiler (SeaBird SBE 19+)
  • Deployment details: Station 12, Arctic Ocean, July 2016
  • Raw file format: .hex converted to .cnv
  • Raw variables:
    • TEMP (Β°C)
    • SAL (PSU)
    • OXYGEN (mg/L)
    • DEPTH (m)

2. Raw β†’ Processed Variables (Specific processing for each variable - Optional)

Raw Variable Processing Step Final Variable in Data Dictionary
TEMP QC filter, converted to UTC timestamps Temperature (Β°C)
SAL Flagged values removed, standardized units Salinity (PSU)
OXYGEN Calibration applied, converted to Β΅mol/kg Dissolved Oxygen
DEPTH Pressure converted to depth Depth (m)

3. Processing Workflow

  1. Retrieve raw instrument files from onboard logger.
  2. Convert proprietary format (.hex) to ASCII (.cnv).
  3. Apply manufacturer calibration coefficients.
  4. Remove flagged/bad values.
  5. Convert timestamps to UTC.
  6. Standardize units to SI.
  7. Merge with metadata (station, instrument ID, deployment date).
  8. Export final dataset as CSV.

Tip

Think of these three tools as complementary:

  • Data Dictionary β†’ defines your variables
    • Codebook β†’ explains your scripts
    • Cookbook β†’ documents your workflow
      Together, they make your dataset FAIR and reproducible.

πŸ“š References & Extra Sources