Fetching plant occurrence records from GBIF

This notebook contains code used to pull plant species occurrence records from the GBIF API.

We use pacman to mange the R packages and load libraries.

## First check for the required packages, install if needed, and load the libraries.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sangerseqR")
remotes::install_github("ropensci/bold")

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, curl, zip, readr, rgbif, usethis, stringr)

Read in BOLD species list and obtain GBIF keys

This block uses a site-specific list of species from the Yellowstone BOLD project to pull any taxon keys for those species hosted on GBIF, matched by exact scientific names.

The list of species we used can be accessed from Anderson and Hoff (2024), and filtered for trnL.

Similarly, and data from a BOLD project could be downloaded and used in this analysis to generate a similar map of global coverage from a localized sampling effort.

The data being read in for this block is data set S4 in the Supplement provided with the publication.

species_list <- readr::read_csv("../data/Kartzinel_et_al_Dataset_S4_20241030.csv") %>%
  pull("Species")
Rows: 570 Columns: 62
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (36): Project Code, Process ID, Sample ID, Field ID, rbcL Seq. Length, r...
dbl  (7): rbcL Trace Count, matK Trace Count, trnL-F Trace Count, trnH-psbA ...
lgl (19): BIN, Catalog Num, Image Count, Contamination, Stop Codon, Flagged ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Get all backbone results (without filtering)
all_matches <- name_backbone_checklist(species_list, kingdom = "plants")

exact_key_matches <- all_matches %>%
  filter(matchType == "EXACT") %>%
  select(usageKey) %>%
  as.list()

# Find taxa that didn't match at species level.
not_exact_matches <- all_matches %>%
  filter(matchType != "EXACT")

Investigate taxon keys for fuzzy matches and higher rank matches

Some keys may indicate that local species are sharing a taxon key, or the keys returned were backed off to higher taxonomic levels. Beware that these can result in many more occurrence records matched at higher taxonomic levels. This can happen for hyper-local species with no occurrence records in GBIF. For our purposes, we kept only exact species matches.

# View results
not_exact_matches
# A tibble: 53 × 26
   usageKey scientificName       canonicalName rank  status confidence matchType
      <int> <chr>                <chr>         <chr> <chr>       <int> <chr>    
 1  3033620 Pulsatilla Mill.     Pulsatilla    GENUS ACCEP…         99 HIGHERRA…
 2  2704858 Calamagrostis Adans. Calamagrostis GENUS ACCEP…         96 HIGHERRA…
 3     3064 Amaranthaceae        Amaranthaceae FAMI… ACCEP…         99 HIGHERRA…
 4  3171742 Erythranthe Spach    Erythranthe   GENUS ACCEP…         99 HIGHERRA…
 5  8148051 Bistorta (L.) Scop.  Bistorta      GENUS ACCEP…         99 HIGHERRA…
 6  8148051 Bistorta (L.) Scop.  Bistorta      GENUS ACCEP…         99 HIGHERRA…
 7  8148051 Bistorta (L.) Scop.  Bistorta      GENUS ACCEP…         99 HIGHERRA…
 8       NA <NA>                 <NA>          <NA>  <NA>          100 NONE     
 9       NA <NA>                 <NA>          <NA>  <NA>          100 NONE     
10       NA <NA>                 <NA>          <NA>  <NA>          100 NONE     
# ℹ 43 more rows
# ℹ 19 more variables: kingdom <chr>, phylum <chr>, order <chr>, family <chr>,
#   genus <chr>, species <chr>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, synonym <lgl>, class <chr>, acceptedUsageKey <int>,
#   verbatim_name <chr>, verbatim_index <dbl>, verbatim_kingdom <chr>

We had 6 species that did not match a species key, resulting in 98% of the species having data we can use from GBIF to explore global geographic coverage of these species.

Set GBIF credentials

The following block will open your .Renviron file. Register an account with GBIF on their website and then add these environment variables to the .Renviron and save: GBIF_USER=“user” GBIF_PWD=“password” GBIF_EMAIL=“email”.

After requesting the data based on our list of taxon keys, we will get millions of occurrence records that we can download; the data will be help in your GBIF portal.

usethis::edit_r_environ()
☐ Edit '/Users/tdivoll/.Renviron'.
☐ Restart R for changes to take effect.

Request the occurrence data

We’ll further restrict the data returned to records that have reliable coordinate data, and use a simple CSV format to reduce the size of the data. The Darwin Core Archive format will include much more metadata, but we’re only interested in the locations for this analysis.

gbif_data_BOLDlist <- occ_download(
  pred_in("taxonKey", exact_key_matches$usageKey),
  pred("hasCoordinate", TRUE),
  pred("hasGeospatialIssue", FALSE),
  format = "SIMPLE_CSV"
)

Get metadata and wait

Get the metadata about the request.

gbif_data_BOLDlist # this will print some info, including the download ID we need to check on the job
<<gbif download>>
  Your download is being processed by GBIF:
  https://www.gbif.org/occurrence/download/0007289-250127130748423
  Most downloads finish within 15 min.
  Check status with
  occ_download_wait('0007289-250127130748423')
  After it finishes, use
  d <- occ_download_get('0007289-250127130748423') %>%
    occ_download_import()
  to retrieve your download.
Download Info:
  Username: tdivoll
  E-mail: timothy_divoll@brown.edu
  Format: SIMPLE_CSV
  Download key: 0007289-250127130748423
  Created: 2025-02-01T12:16:59.420+00:00
Citation Info:  
  Please always cite the download DOI when using this data.
  https://www.gbif.org/citation-guidelines
  DOI: 10.15468/dl.7bdmx3
  Citation:
  GBIF Occurrence Download https://doi.org/10.15468/dl.7bdmx3 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2025-02-01

Check the status of the download.

occ_download_wait('0066939-241126133413365')
status: succeeded
download is done, status: succeeded
<<gbif download metadata>>
  Status: SUCCEEDED
  DOI: 10.15468/dl.48qedg
  Format: SIMPLE_CSV
  Download key: 0066939-241126133413365
  Created: 2025-01-13T17:50:02.365+00:00
  Modified: 2025-01-13T18:02:18.702+00:00
  Download link: https://api.gbif.org/v1/occurrence/download/request/0066939-241126133413365.zip
  Total records: 20510078