Skip to the content

Data Policies and Guidelines

Introduction

The purpose of this document for the R.J. Cook Agronomy Farm (CAF) Long-term Agroecological Research site is to assess and evaluate data management recommendations and procedures.  Multiple agencies conduct research on CAF and collect a diversity of data (see Data Inventory), including multiple universities, governmental organizations and corporations.  Collected data need to be easily shared, archived, and discoverable for future researchers.  Without a coherent and agreed upon data management plan, data often is lost or poorly documented preventing easy reuse or reproducibility.

This will be used as a technical document to consolidate data and to standardize the collection process to improve data integrity, archiving, and sharing.  Because there exist ongoing experiments, existing collection procedures, and various historic and unorganized datasets, this plan is intended to act as an intermediary “compromising” step towards a final data solution.

The Data Management Plan is a similar document but is less technical and less detailed.

Data Infrastructure

Data infrastructure consists of multiple cloud service provides and local workstations.

  • Google Drive via the cafltar.org G Suites account is used for backup and sharing of data associated with unorganized, ongoing, and finalized projects.
    • Personal documents should be in individual user’s “My Drive” folder and collaborative data and documents should be in a “Team Drive” folder.
  • The CAHNRS cloud share, \\cloud.cahnrs.wsu.edu\Cahnrs.Css.HugginsLab, is used for archived data
  • Git and Github are used for code sharing and version control of all scripts and programs associated with data processing and derived applications such as decision support services.  Associated repositories should be created under the Organization “caf-ltar” (www.github.com/cafltar)
  • Basecamp via the ARS-LTAR account is used for project management
  • ArcGIS Online is used to house “published” versions of GIS-related datasets under the USDA ARS account (https://usdaars.maps.arcgis.com)
  • Certain datasets will fall within the LTAR’s common experiment program and will be housed within the Common Observational Repository (CORe), or now referred to as the “LTAR Data Portal”, of the NAL.
  • Azure Blob Storage via the LTAR-Cook-Agronomy-Farm subscription is used to house “published” datasets that are not published to ArcGIS Online or through the LTAR Data Portal.
    • Data are stored in the “cafpublic” blob storage account, and the “data” container, which has a CNAME record pointing to it so the public can access files via “files.cafltar.org”.
  • The NAL LTAR data portal, GeoPortal, is used to define all metadata (but is not used to host actual datasets).  Once data are hosted on a long-term repository, the metadata should be submitted to GeoPortal.
  • Local workstations will house working data but will be backed up on a regular schedule through a series of external hard drives and cloud solutions (Google Drive, Dropbox, OneDrive).
  • Microsoft Azure via the LTAR-Cook-Agronomy-Farm subscription is used to manage various data
    • Azure Blob Storage is used to house “published” datasets that are not published to ArcGIS Online or through the LTAR Data Portal.
      • Data are stored in the “cafpublic” blob storage account, and the “data” container, which has a CNAME record pointing to it so the public can access files via “files.cafltar.org”.
    • Azure Blob Storage is also used for various intermediate steps in various data pipelines
    • Azure Functions and Azure Logic Apps are also used to transform data, for alerts, for quality control, and other various data processing steps.  See the individual data pipelines for details.
    • Azure Cosmos DB acts as a pseudo, queryable, data lake for CAF data.  See the CosmosDatabase document for more details of organization and schema.
    • Azure Virtual Machines are used for running Campbell Scientific LoggerNet Admin
  • Resource network: https://onodo.org/visualizations/19755



A basic flow diagram of data reflects quality and shareability of data (Figure 1).  


Figure 1: Data quality follows a general path related to the infrastructure housing the data.  “My Drive” and “Team Drive” refers to the root directory of users’ Google Drive account.

Organizing and managing data

Introduction

An agreed upon scheme for organizing and naming files aids in data sharing and project collaboration.  Standards also allows server-run scripts to automate numerous data processing steps.

The structure takes advantage of Google Drive’s “My Drive” “Team Drive” capabilities.

  • Team Drive > {ProjectName}: Contains data in a shareable state, with all associated metadata present and internal standards followed.
  • My Drive or personal storage: Contains data not in a shareable state - mostly for backing up projects that belong to a single researcher and that are ongoing.  Once projects are finalized the data moves to Archives.  If more than one persons contributes to the project, these data are moved to Team Drive folder (with associated metadata).
  • Archives: Contains data that are not organized or structured - mostly for backing up historic data, user data not organized, etc.
  • Team Drive > Documents: Contains template documents, controlled vocabularies, inventories of metadata, etc.

Datasets, experiments, and other projects are identified by a unique name (further information on naming below).  This name is used for the folder name located under “Projects” directory, a Team Drive name, the base of the associated Github repository, and the associated Trello (or Basecamp) project.

Naming projects

It is advised that project names include the area of interest, category of data, and a descriptive title in either Pascalcase or using hyphens and underscores.  For example, files relating to the planning of eddy covariance flux towers across the CAF LTAR region will have the title: CafInfrastructureECTowerSetup (or caf_infrastructure_ec-tower-setup).  Where:

  • Area of interest = Caf
  • Data category = Infrastructure
  • Descriptive title = ECTowerSetup

A definition of area of interests and data categories can be found in the “Documents” folder of the shared drive (ControlledVocabularies).

File organization

The organization of the data contained within the “Projects” directory has certain recommendations, but ultimately is left to the managers of the project if the data are self-described.  Project folder may have subfolders for:

  • Literature: Important links and pdfs of literature.
  • Working: Data and scripts required to generate the final product/dataset.
    • All data used in Methods should be kept in raw or received format and stored in a "Received" or "Input" folder that is made to be read-only.
    • It is recommended to create subfolders for each coding language or project file.  For example, an ArcMap project should be in a “GIS” or “ArcMap” folder, while Excel files for the same project should be saved in a “Excel” folder.  Under these folders there should exist “Input” and “Output” directories.
    • Ideally, all steps to produce the product from the input data should be self-documented by code (with comments and a manual) so one could regenerate the results if needed.  Those code files should be pushed to a git repo, preferably on Github under the LTAR Organization.
  • Results: Final datasets/products.
  • Publications: Related manuscripts, extension bulletins, etc.
  • Received: Contains all exogenous data used by the project; the idea is to make the project folder self-contained and shareable in isolation.  This folder should be read-only.
  • For some data, it makes sense to separate versions by year - for these data, a subdirectory under the project directory with the indicated year is used.  All data within should be self-contained. Redundant date is expected.

Any project that uses scripts or other code should have an associated git repository hosted on Github with the ProjectName_Descriptor.  For example, R code within a, for example, R folder in the LtarModelingAgroecosystemClasses project may have a git repository named LtarModelingAgroecosystemClasses_R

Data flows

The processing of samples and measurements made in the field have a well-defined flow as associated datasets are created.  The source of data will dictate the specific flow of data, but in general flow will be as follows with data being generated at each step (Figure 2), with some QC occurring at each step.


Figure 2: A generalization of the data flow that accompanies the processing of field samples and measurements.  Note that QC can occur at all steps, but a final QC check occurs before publishing. The level of the QC corresponds to the quality of the published data.

For example, a simplified flow of data for a soil sample includes taking a soil core and recording date, sample ID, etc. (Collection); storage in a freezer and recording location and date (Storage); dried, sieved, ground, analyzed in lab with preparer and results entered into a data entry file (Processing); combining the data entry file into a master file with an ingest script (Ingest) with automatic QC, and finally updating metadata file and final QC check before pushing on published data (final QC/Publish).

The flow between Process and Ingest is automated through standard file naming schemes and standard document formats.  Each data source will have different procedures but generally they will follow these standards:

  • Hand inputted data (in the case of milling, threshing, soil dry weight, etc) and incoming data (e.g. from weather station) will have the following file name format when possible: "{Project}_{MeasurementType}_DataEntry_{Initials}_{YYYYMMDD}", CafStudyNutrientsLimeFertility_SeedMass_DataEntry_BRC_20160813.csv
    • New file for each entry will ensure old files do not get corrupted/tampered with.
    • If data are submitted via a script, the name and version are entered in the place of {Initials}.
  • Data entry /raw files are stored in the {Project}/Methods/Input/Raw folder ("Input" folder name not final).
  • The data entry files will be aggregated and prepared for baseline automatic QC.
  • Raw data input files will have accompanying metadata describing methods used, etc.

Some systems may have limitations preventing filenames from adhering to these guidelines.  In this case, they files will be distinct from further processed files by residing within a “raw” or dataentry” directory.

Data entry files should contain metadata that describe the methods of attaining the submitted data.  For established methods, a reference ID to an Standard Operating Procedure may suffice but this file must be contained within the project directory.

Data quality code will be published with data and at respective publishing frequency; near-real time data will be provisional and published hourly, for example.

Annual review of datasets will be conducted to finalize data and quality code will be updated.

Some data do not follow the above-mentioned flow but instead are incorporated into the data management plan at the end of a project.  Historic data falls along these lines as the data are already collected and finalized. Similarly, a PI that is not closely associated with the LTAR but later decides to share his or her data will apply here.  For these cases, a data manager will sit with the manager of the dataset to organize the data and create the necessary documents.

Standards for data

It is highly recommended that data contained in a user’s workstation and in the User or Backup folder hosted by Google Drive follow the following guidelines.  Data hosted within the Projects folder of Google Drive must follow the following guidelines.

 

  • Data format should follow guidelines explained in Tidy Data (Wickham, 2014) when possible, although logical grouping of variables and other exceptions to third normal forms is expected for ease of data transfer and sharing.
  • Data will be self-described through a combination of standard metadata files and descriptor files located at the root of each project folder:
    • readme.txt/readme.md: These files should be included in one or more directories to document the origin of files at a minimum.  File description, processing, intent, caveats, etc. should be included as well.
    • manifest.json: Describes important folders and files, their intended use, general structure of project folder, and any requirements.  Manifest files can be auto-generated from readme.txt/readme.md files placed within directories.
    • description.txt: Simple description of the project.  These files are used to help collaborators understand intent of project.
      • Recommended information: title, purpose, contributors, current status, starting date, duration/expected ending date, actual completion date
      • By defining the contributors of the project, it puts directory names into a context.  For example, the "Received" folder will include data created external to the contributors.
    • metadata.json: Describes final dataset derived from project in ISO 19115:2014.  There may be multiple datasets though, so there may not be a one-to-one relationship - maybe just Project-level metadata?.  May need to rethink.
  • A project folder should function in isolation; any required data from outside datasets are to be saved in a “read-only” folder.  The original version and source of the input dataset needs to be documented.
  • Ideally, a dataset should be reproducible from input data to final product using documented methods and/or scripts.

 

The LTAR network has yet to establish standards for quality assurance and quality control, including standard codes for data quality, reason for changing data, replacement actions, and more.  These standards will be adopted once they are determined.

 

TODO: Update details regarding adoptions of MODIS QC codes, internal specifications, and mapping to NAL LTAR repository.

Data Health

To ensure data practices outlined in this document are being followed, and to adjust these guidelines if needed, researchers meet with data managers on a regular schedule.  

Policies for access and sharing (from Data Curation - USDA Data Management Plan 11.10.16_final.pdf)

Scientists are able to make restrictions or give accessibility to the public in the repository resource. The sharing of the data housed in ArcGIS Online or Google Drive will be focused on the goals of end-users. These users include but are not limited to: graduate students, PI’s, stakeholders, collaborators, and the general public.

Access to datasets will also be increased through the use of the National Agricultural Library’s Ag Data Commons application. This application provides URLs for mapping data to potential end-users. The mapping application in the Ag Data Commons displays data visually as a map or within other functions, such as queries, downloads and graphs. A Digital Object Identifier (DOI) will be created for each dataset in order to map the application to data. The DOI will make the dataset identifiable and give the users a way to cite data easily.

All datasets will be securely withheld from public access until they are ready to be shared. Datasets will be screened and go through an approval process to ensure the data is accurate, cleaned up, and ready for public access.

References

Wickham H. Tidy data. Journal of Statistical Software. 2014 Aug 1;59(10):1-23.