Open Science

A guide to open science principles.

Open Science in the Data Collection Stage

During the data collection or data creation stage of a research project, it's important that researchers practice data management to not only improve organization and workflows but to ensure the integrity of their results.  When data are easier to find, understand, and navigate, a research project can more easily be shared and reproduced.  This page covers suggestions for file naming conventions, stable file formats, and file hierarchies.

For more a more comprehensive overview, please visit our Research Data Management Guide:

File Naming Conventions

A file naming convention is a standard framework for naming your files in a way that describes them accurately and consistently.  Establishing a file naming convention prior to starting a project can improve organization and accessibility for your research ream.

Infographic describing file naming conventions

Image created using Canva.com


Date Formatting
By choosing a standard format for dates, you can avoid confusion and error when naming files.  The ISO 8601 date format is an international standard for representing dates and times, which allows for unambiguous file description by ordering as year, month, & day:


Example: YYYY-MM-DD =  2021-05-02 or 20210502

Follow this link for more information about ISO date formatting: https://www.iso.org/iso-8601-date-and-time-format.html 


Standard Characters
Only standard, alphanumeric characters should be used in file names. It’s good practice to: 

  • Avoid special characters such as !, #, &, and *. This can impact how the file is displayed if moved between different operating systems.
  • Avoid starting or ending your filename with a non-alphanumeric character such as a hyphen or period

Example: 20180502_survey_results.csv, rather than 201805.survey.results.csv

  • Use underscores or capital letters to separate words in your file name:

CamelCase

Pot_hole_case

ShovelTestSample002.csv 

shovel_test_sample_002.csv

20240715_TissueScanSample005.tiff 

20240715_tissue_scan_sample_005.tiff

Sequential Ordering

When using numbers in a file name to designate an order, use leading zeros for consistency and better readability.  Labeling a file with 01 will order files up to 99, and 001 will order files up to 999.

File Directory Structure Conventions

Structuring your data folders in a directory is useful for making it easier to locate and organize files and versions.  Evaluate the best hierarchy for organizing your files and determine if a deep or shallow hierarchy suits your needs better. If your team has multiple independent data collections, it's recommended to create distinct folders for each one. 

Directory top-level folders should include the project title, a unique identifier, and the date (year).  The substructure should have a clear, consistent naming convention, e.g., uniform conventions for labeling each run of an experiment, each version of a dataset, and/or each person in the group.


Sample File Directory Screenshot

File Formats to Avoid Obsolescence

As technology changes, so too do the ways researchers can access and utilize data.  This includes ever changing file formats for proprietary software.  To increase the longevity of your data, it is recommended to use file formats that are likely to remain accessible for the foreseeable future.

Obsolescence-resistant file formats are typically:

  • Non-proprietaryFile types
  • Open, documented standards
  • Commonly used by the research community
  • Standard representations (i.e. ASCII or Unicode)
  • Unencrypted
  • Uncompressed

Examples of these formats are:

  • PDF or RTF (not Word)
  • ASCII or CSV (not Excel)
  • MPEG-4 (not Quicktime)
  • TIFF or JPEG2000 (not GIF or JPG)
  • XML or RDF (not RDBMS)

Image by Esteban.alej from Wikimedia Commons

For a more exhaustive list of recommended file formats to avoid obsolescence, visit -

More Examples of Stable File Formats

 

Type of Data Stable File Format Examples
Text ASCII, XML, PDF/A, HTML, UTF-8
Tabular Data CSV
Still Images TIFF, JPEG, PDF, PNG, GIF, BMP
Geospatial SHP, DBF, GeoTIFF, NetCDF
Databases XML, CSV

Data Versioning

It can be both useful and necessary to retain different versions of datasets as they are transformed.  For example, data may need re-processing to include new calculations; errors may need to be corrected; or new data might need to be generated and added to the dataset.


Instead of editing the original file, which may risk irreversible loss of the raw data if an error is made during the overwriting process, researchers can create versions of datasets as changes are made.


To manage and keep track of older data versions, it’s recommended to add a number to the file name for each version.

For example, V2, V3, etc:

  • 20240325_MultiAnalysis_Submission_v2.csv
  • HarrisStudy_Survey002_20220603_v3.1.csv

Data Versioning Control Tables

Creating a control table can keep track of different versions of your files and help you and your team document changes.  A control table describes which versions of the document were created, what the change was, who made it, and when.  Consider creating a README file or spreadsheet listing the data versions to include with your dataset prior to storage. See below for an example:

Version Author Change or Purpose Date
1.1 TCN Corrected formula in column 7 20240427
1.2 RD Amended references 3 and 4  20240430
1.3 JRW Formatted results table 20240515
2.0 MJM Added statistical analysis section 20240529