🗂️ Best Practices for Metadata

📖 What is Metadata?

Metadata is data about data — it provides the context that makes research data understandable and reusable.
It describes the who, what, where, when, how, and why of data collection.

Metadata also informs users about:

🔒 Constraints → limitations on use or sharing
🔄 Update frequency → how often data is refreshed
🌐 Interoperability → standardized terms that make data discoverable across repositories

📋 Metadata CanWIN Collects

CanWIN Metadata

Mandatory metadataRecommended metadata

Title
Summary
Location
Date
Authors and affiliations
Keywords
Licensing information and terms of use/access
Data status and versions
Update and maintenance frequency
Data Type (dataset, report, model etc)

Sample and analytical methods (steps or methods to collect and process data)
Instruments & deployment details (instrument type, sensors, deployment dates and locations, etc)
Related resources
Awards & Funding information
Website
Theme (marine, atmospheric, freshwater, cryosphere, remote sensing)
Variable descriptions (names, units, media)

🔬 Best Practices for Metadata Submitters (Researchers, Data Providers)

Use precise keywords → 4–6 well‑chosen terms improve findability.
Include provenance → who collected the data, when, and where.
Record data collection & processing steps → document methods for transparency and reuse.
Create a data dictionary → define variables - units and descriptions.
Keep metadata current → update when methods, instruments, or contributors change.
Respect Indigenous Data Sovereignty → apply CARE and OCAP principles when relevant.
Provide licensing & access rights → specify Creative Commons or institutional licenses.
Write a meaningful dataset description → align with FAIR principles by ensuring the description is:
More than 50 characters
Written in plain language (accessible to non‑specialists)
If possible, includes scope, purpose, and context (what the data represents, why it was collected)
Avoids jargon or unexplained acronyms

🏢 Best Practices for Data Centers / Curators

Ensure interoperability → export metadata in multiple formats (JSON, RDF/XML, HTML, PDF).
Apply controlled vocabularies → align with standards (ISO 19115, GCMD keywords).
Link related resources → connect datasets to publications, instruments, campaigns.
Assign persistent identifiers (DOIs/Handles) → ensure datasets are citable and traceable.
Maintain update frequency records → indicate whether data is static, ongoing, or regularly updated.
Validate metadata quality → check for completeness, consistency, and compliance with FAIR principles.
Provide long‑term preservation → ensure metadata remains accessible even if datasets are retired.

📑 Data Dictionary, Codebooks, and Cookbooks

Beyond metadata fields, three documentation tools strengthen the understandability and reproducibility of your datasets:

📖 Data Dictionary

A data dictionary defines the terms in your data files and applies common names to variables so that your data is understandable to others.
It should include variable names, units, and clear descriptions.

✅ Best practice: Always provide a data dictionary alongside your dataset. This ensures that future users (including yourself!) can interpret variables correctly.

Template (downloadable):
📥 Data Dictionary Template

📖 Data Dictionary Example

Variable name	Common name	Units	Description
T_C	Temperature	°C	Water temperature measured at depth
Salinity	Salinity	PSU	Practical salinity units
O2_mgL	Dissolved Oxygen	mg/L	Oxygen concentration in water

🛠️ Codebook

A codebook describes the key functions, modules, or scripts used to process the data.
It documents the logic behind transformations, cleaning steps, and analysis routines.

✅ Best practice: Include a codebook whenever scripts are used to process or analyze data. This makes workflows transparent and easier to reproduce.

Template (downloadable):
📥 Codebook Template

🛠️ Codebook Example

Function/Module	Purpose	Input	Output	Notes
clean_data()	Removes NULL values	raw.csv	clean.csv	Applied before analysis
normalize_units()	Standardizes units	clean.csv	norm.csv	Converts to SI units
merge_metadata()	Adds location + instrument info	norm.csv + metadata.json	final.csv	Ensures provenance

🍳 Cookbook

A cookbook describes the data retrieval and processing steps in a workflow, step by step.
It’s essentially a recipe for reproducing your dataset preparation.

✅ Best practice: Provide a cookbook for complex workflows, especially when multiple tools or scripts are involved.

Template (downloadable):
📥 Cookbook Template

🍳 Cookbook Example

Dataset Cookbook

1. Data Retrieval

Instrument: CTD profiler (SeaBird SBE 19+)
Deployment details: Station 12, Arctic Ocean, July 2016
Raw file format: .hex converted to .cnv
Raw variables:
- TEMP (°C)
- SAL (PSU)
- OXYGEN (mg/L)
- DEPTH (m)

2. Raw → Processed Variables (Specific processing for each variable - Optional)

Raw Variable	Processing Step	Final Variable in Data Dictionary
TEMP	QC filter, converted to UTC timestamps	Temperature (°C)
SAL	Flagged values removed, standardized units	Salinity (PSU)
OXYGEN	Calibration applied, converted to µmol/kg	Dissolved Oxygen
DEPTH	Pressure converted to depth	Depth (m)

3. Processing Workflow

Retrieve raw instrument files from onboard logger.
Convert proprietary format (.hex) to ASCII (.cnv).
Apply manufacturer calibration coefficients.
Remove flagged/bad values.
Convert timestamps to UTC.
Standardize units to SI.
Merge with metadata (station, instrument ID, deployment date).
Export final dataset as CSV.

Tip

Think of these three tools as complementary:

Data Dictionary → defines your variables
- Codebook → explains your scripts
- Cookbook → documents your workflow
  Together, they make your dataset FAIR and reproducible.