FAQ: Metadata

Frequently asked questions about the University of Helsinki Data catalogue, metadata and preservation of research data.
What is metadata?

Metadata, or descriptive data, as the name implies, contain information about the data. Typical metadata include the name of the data, the creators, the time of creation, the data type, the descriptions of the variables used or the software that may be needed to open the data. Metadata can be divided into metadata that support the discoverability of the data and metadata that support the understandability or reuse of the data. Examples of metadata that support discoverability include the name of the creator, the discipline and keywords describing the data. Metadata supporting the reuse of data include explanations of the variables used and information on how the data was collected.  

Comprehensive metadata on research data are of crucial importance for the reuse of the data. 

What is a good title for research data?

A good title is clear, informative and identifies the data. It gives an immediate idea of what the data contains and what kind of research it relates to. Here are some suggestions for writing a good title:

  1. Be precise and explicit: Try to describe the content of the data as accurately as possible. Avoid general and vague expressions.
  2. Use keywords: Include key terms or keywords that describe the subject of the data and can help others to find the data through search engines.  
  3. Mention the time period and geographical area: If the data relates to a specific time period or geographical area, include this information in the title.
  4. Avoid abbreviations and technical jargon: Use terms that are understandable to a general audience, unless the data is intended for a specific or professional audience.
  5. Keep the title concise: Aim for a title that is short and to the point, but still informative enough.

Titles that follow these guidelines would be, for example:  

  • Physical activity among Finnish adolescent in 2020-2022
  • The impact of education on employment: a longitudinal study in Finland 2010-2020
  • The impact of climate change on biodiversity in Greenland 1990-2020

Avoid titles that say little about the data itself, such as "All data up to 2000" or "E. coli measurements".

What is a good description/abstract for research data?

Key points in the abstract for research data:

  1. Content of the research data
    • What is the research data about?
    • The main variables, themes or phenomena covered by the data.
    • Size and structure of the data (e.g, quantitative vs. qualitative data, file types).
  2. Research methods and data collection
    • How was the data collected (e.g, questionnaires, interviews, sensors, modelling)?
    • Temporal and geographical coverage of the data.
    • Equipment, software or data sources used.
  3. Purpose and relevance
    • Why was the data collected?
    • What research questions can it answer?
    • Any restrictions or special considerations on the use of the data.
  4. Format and accessibility of the data
    • In what format is the data available (e.g, CSV, JSON, image data)?
    • Is the data openly available or restricted (e.g., access by request)?
    • Citation of original sources and any additional resources.
  5. Any ethical or legal considerations
    • Does the material contain personal data or sensitive information?
    • Is the data anonymised?
    • Are specific permissions required?

We have selected a few examples for research metadata by type of data. You can take example from them or use the concise example below.

 

A good example of a research data abstract:

A comprehensive example of abstract for research data: This dataset contains air pollution measurements collected in Helsinki in 2023. The data set consists of PM2.5 and PM10 concentrations measured at 15 different stations at hourly intervals from 1 January to 31 December 2023. The data was collected by the sensor devices of the air quality measurement network of the Helsinki Metropolitan Area and is available in CSV and JSON formats. The data can be used to analyse air quality trends and for urban planning. The use of the data is open, but a reference to the original source is mandatory.

What makes a good README file?

A good README file will provide the key information for further use. We have created a template README file based on imaginary data. You can download and modify it to suit your own use. Additional information may include the software used for opening the files, the data collection methods and instruments, the number of observations and variables, the type of measuring instrument used and its manufacturer.

What does provenance mean?

Provenance refers to the history of the creation and modification of the data. Provenance information should include, for example, information on the modification of the data, the correction, the splitting of the data into parts, or the combination of the data with other datasets.

Data provanence information can include information like…

Data Creation & Source Information

Origin: 

  • Who created or collected the data? (e.g., researcher, institution, automated system)
  • Collection Date & Time: When was the data collected/generated?
  • Data Sources: If the dataset is derived from other sources, list them with citations.

Data Processing & Transformation

  • Processing Steps: What modifications, cleaning, or transformations were applied?
  • Software & Tools: Any tools, scripts, or software used for data processing (including versions).
  • Intermediate Data: If applicable, describe intermediate datasets created before the final version.

Data Contributors & Roles

  • Roles & Responsibilities: Define contributions, e.g., who curated, analyzed, or published the data.

Data Changes

  • Version Number: Identify the version of the dataset (e.g., v1.0, v2.1).
  • Change History: Document modifications, corrections, or updates to the dataset.
  • Timestamps for Changes: When were updates made?
What does restricted access mean?

Restricted access to research data means that the data in question is not freely available to everyone, for example, it cannot be downloaded directly from a repository, but access must be requested. There are usually restrictions on the use and sharing of such data. These restrictions may be for a number of reasons, such as:

  1. Data protection: If the data contains personal or sensitive information, access to it should be restricted to protect the privacy of the subjects.
  2. Ethical reasons: Access to research data may also be restricted because of other sensitive elements contained in the data. For example, restrictions related to biosafety or the occurrence of endangered species.
  3. Contractual or commercial interests: Access to data may also be restricted by contract in some circumstances. This is often linked to the commercial value of the data.

Restricted access does not automatically mean that data cannot be made available under any circumstances. It just means that access to the data must be requested. Usually repositories have a straightforward process for this, which includes explaining why the data is being requested and what it will be used for.