Open Science in the Data Collection Stage
During the data collection or data creation stage of a research project, it's important that researchers practice data management to not only improve organization and workflows but to ensure the integrity of their results. When data are easier to find, understand, and navigate, a research project can more easily be shared and reproduced. This page covers suggestions for file naming conventions, stable file formats, and file hierarchies.
For more a more comprehensive overview, please visit our Research Data Management Guide:
File Naming Conventions
A file naming convention is a standard framework for naming your files in a way that describes them accurately and consistently. Establishing a file naming convention prior to starting a project can improve organization and accessibility for your research ream.
Image created using Canva.com
Date Formatting
By choosing a standard format for dates, you can avoid confusion and error when naming files. The ISO 8601 date format is an international standard for representing dates and times, which allows for unambiguous file description by ordering as year, month, & day:
Example: YYYY-MM-DD = 2021-05-02 or 20210502
Follow this link for more information about ISO date formatting: https://www.iso.org/iso-8601-date-and-time-format.html
Standard Characters
Only standard, alphanumeric characters should be used in file names. It’s good practice to:
- Avoid special characters such as !, #, &, and *. This can impact how the file is displayed if moved between different operating systems.
- Avoid starting or ending your filename with a non-alphanumeric character such as a hyphen or period
Example: 20180502_survey_results.csv, rather than 201805.survey.results.csv
- Use underscores or capital letters to separate words in your file name:
CamelCase |
Pot_hole_case |
ShovelTestSample002.csv |
shovel_test_sample_002.csv |
20240715_TissueScanSample005.tiff |
20240715_tissue_scan_sample_005.tiff |
Sequential Ordering
When using numbers in a file name to designate an order, use leading zeros for consistency and better readability. Labeling a file with 01 will order files up to 99, and 001 will order files up to 999.
File Directory Structure Conventions
Structuring your data folders in a directory is useful for making it easier to locate and organize files and versions. Evaluate the best hierarchy for organizing your files and determine if a deep or shallow hierarchy suits your needs better. If your team has multiple independent data collections, it's recommended to create distinct folders for each one.
Directory top-level folders should include the project title, a unique identifier, and the date (year). The substructure should have a clear, consistent naming convention, e.g., uniform conventions for labeling each run of an experiment, each version of a dataset, and/or each person in the group.
File Formats to Avoid Obsolescence
As technology changes, so too do the ways researchers can access and utilize data. This includes ever changing file formats for proprietary software. To increase the longevity of your data, it is recommended to use file formats that are likely to remain accessible for the foreseeable future.
Obsolescence-resistant file formats are typically:
- Non-proprietary
- Open, documented standards
- Commonly used by the research community
- Standard representations (i.e. ASCII or Unicode)
- Unencrypted
- Uncompressed
Examples of these formats are:
- PDF or RTF (not Word)
- ASCII or CSV (not Excel)
- MPEG-4 (not Quicktime)
- TIFF or JPEG2000 (not GIF or JPG)
- XML or RDF (not RDBMS)
Image by Esteban.alej from Wikimedia Commons
For a more exhaustive list of recommended file formats to avoid obsolescence, visit -
More Examples of Stable File Formats
Type of Data | Stable File Format Examples |
Text | ASCII, XML, PDF/A, HTML, UTF-8 |
Tabular Data | CSV |
Still Images | TIFF, JPEG, PDF, PNG, GIF, BMP |
Geospatial | SHP, DBF, GeoTIFF, NetCDF |
Databases | XML, CSV |
Data Versioning
It can be both useful and necessary to retain different versions of datasets as they are transformed. For example, data may need re-processing to include new calculations; errors may need to be corrected; or new data might need to be generated and added to the dataset.
Instead of editing the original file, which may risk irreversible loss of the raw data if an error is made during the overwriting process, researchers can create versions of datasets as changes are made.
To manage and keep track of older data versions, it’s recommended to add a number to the file name for each version.
For example, V2, V3, etc:
- 20240325_MultiAnalysis_Submission_v2.csv
- HarrisStudy_Survey002_20220603_v3.1.csv
Data Versioning Control Tables
Creating a control table can keep track of different versions of your files and help you and your team document changes. A control table describes which versions of the document were created, what the change was, who made it, and when. Consider creating a README file or spreadsheet listing the data versions to include with your dataset prior to storage. See below for an example:
Version | Author | Change or Purpose | Date |
1.1 | TCN | Corrected formula in column 7 | 20240427 |
1.2 | RD | Amended references 3 and 4 | 20240430 |
1.3 | JRW | Formatted results table | 20240515 |
2.0 | MJM | Added statistical analysis section | 20240529 |