Introduction

Data associated with insect specimens is no less important than the specimens themselves and, arguably, is used more often. Curators, researchers, and the public benefit when data is available and accessible.

What data is most important?

When it comes to entomology collections, it is vital to have a unique identifier for each specimen and to know the three "W”s:

  • what (taxonomic name of the specimen)
  • where (locality where it was found and storage location)
  • when (the date of collection)

Who (collector or donor) and how (acquisition details, collection permits, and other associated information) are also very helpful. Images of specimens are part of the data too.

Types of collection data

Data for entomology collections can include different types of information.

Data about the collection

Data about the collection may include acquisition records or details about distinct parts of the collection (for example British flies, Sumatran butterflies or the collection of Baron de Worms).

A screenshot of a database interface
Example of data about the collection: acquisition record on Axiell Collections (type specimens from the Dufresne collection).

Taxonomy data

Taxonomy data includes information such as number of specimens and storage locations (as entomology collections are normally arranged taxonomically). This data may be in the form of catalogues, lists of taxa held, or indices.

Example of taxonomy data: a list of taxa held for Diptera: Psychodidae.

Specimen data

Specimen data includes where, when, and under what circumstances a specimen was collected and acquired by the collection. This data is most often on the specimen label but may also be found in old registers or field notebooks.

Example of specimen data: PAPIS specimen database.

Formats of collection data

Data for entomology collections can be found in three main formats:

  • analogue- the physical specimen label or other physical objects like manuscripts, books, and catalogues (you can read more about labelling in Specimen Preparation and Conservation)
  • digital- digital files containing searchable text, such as spreadsheets and databases
  • hybrid- digital files not containing textual information, such as images of labels or registers

Only digital data can be used and shared with ease. That is why we will focus on digital data in the rest of this training.

Data management software

Digital data can be stored and managed in a variety of formats and software, including: 

  • Plain text files
  • Spreadsheets
  • Collection management systems (CMS)

Plain text files

Plain text files are the simplest way to manage digital data. Text files allow you to retrieve information by searching for a string of characters. However, it’s very difficult to maintain data integrity and structure or to analyse data.

Spreadsheets

Spreadsheets can be a very useful tool and are used widely to collect and exchange data. It is possible to structure data, and search, filter and analyse records. However, data presentation and integrity capabilities are limited.

Collection management systems

Collection management systems (CMS) are mostly built as relational databases and can perform most, if not all, functions necessary in collection management. There are a great number of CMSs suitable for natural history collections. They can be:

Whatever data management software is used, it is imperative to follow accepted data standards and to structure data in a way that they can be easily queried, manipulated, and shared. Biodiversity Information Standards (TDWG) has developed Darwin Core and World Geographical Scheme to help with this.

Digitisation

Digitisation means translating data from analogue to digital format. It is one of the most important tasks in collection management. It can be an expensive and laborious project but saves time and effort in the long run.

Digitisation can be done to different levels, but once a level is chosen it is crucial to be consistent and maintain this level of digitisation through the entire project.

At the minimum level, digitisation must include the following information for each specimen:

  • unique ID
  • taxonomy
  • location in the collection
  • a transcription of the physical label

Enhanced levels of digitisation may include:

  • verbatim and interpreted locality
  • georeferencing information
  • associated data
  • images

Unique ID

A unique identifier (uID) is a combination of letters and numbers that uniquely identifies every specimen in the collection. Every specimen must be assigned a uID that is unique to that specimen. The uID is assigned during the registration process, either by software or manually.

In many cases, particularly when acquiring a big insect collection, curators and registrars will register entire collections under one registration number (e.g. NMSZ.2019.29). They then add an additional number for every specimen, to make the string unique for each specimen (NMSZ.2019.29.17).

It can be difficult to avoid duplicate numbers when assigning uIDs manually. If possible, use continuing numbering to ensure each number is used only once (NMS-000000001 to NMS-999999999). This format also means the data can be sorted properly by spreadsheets and databases.

A label with the uID should be printed and added to the corresponding specimen (as pictured).

Pinned insect specimens with barcodes attached
Example of barcode labels. Credit: Neil Hanna

A uID can also be encoded as a barcode and printed as a barcode label. Two-dimensional DataMatrix and QR barcodes are the most used formats in entomological collections. It is best practice to print the alphanumeric version of the uID next to the barcode. This way the label will be both human- and machine-readable. See the iDigBio Specimen Barcode and Labeling Guide for more information.

Universally Unique Identifiers (UUID or GUID) are the safest way to ensure that uIDs are unique. They are widely used in software, including CMS. However, because of the length of the string (32 characters) they’re not very user-friendly and not always practicable for labels.

Taxonomy

Recording accurate biological taxonomy information can be challenging for entomological collections. There are vast numbers of species, even just in the UK. Curators are not always experts in insects, and never in all insects. Names for species recorded in the collection may be incorrect or outdated.

Fortunately, the UK has one of the best studied faunas in the world, and modern catalogues and checklists have been published for all major groups. Consult these resources to help you accurately record species in your collection:

If a specimen is re-identified, it is important to:

  • preserve the previous identification history
  • record the new taxonomic name
  • record who re-identified the specimen
  • record the date when it was re-identified

Location

Record the location of the specimen in the collection and update the location any time the specimen is moved.

Location can be recorded as absolute (room, aisle, cabinet, drawer/box) or relative (taxonomic series). In the latter case a distinct taxonomic unit, for example a family, is assigned a number or a string. If another drawer or a cabinet is added to a series it does not affect all other series, and they do not have to be updated.

Label transcription

It is important to transcribe the physical data labels of each specimen when digitising a collection. To do this, you usually need to remove the labels from the specimen pin. You can then photograph the labels and transcribe from these images.

Transcribing data labels can be difficult. These are some of the problems you might come across.

Incomplete data

Sometimes labels have incomplete data, particularly in terms of locality (where the insect was collected). 

The locality of the specimen should be recorded as completely as possible, in increasing order of precision (for example, country, state or county, major area, minor area, exact locality, coordinates). However, often this was not done with older specimens. 

In these cases, when a locality cannot be identified with certainty,[GW1]  you will need to do some detective work. For example, you may have to use old maps, gazetteers, or even newspapers to find how the place names changed through times. 

Overall, when digitising labels it is best practice to record locality information in two ways: verbatim (as it occurs on the label or in the ledger or field notebook) and interpreted (complete locality information separated into fields such as country, state or county, and so on).

Blog post has been removed [GW1]

Several examples of printed collecting labels
Example of printed collecting labels.

Disassociated data

Collectors in the past often recorded complete collecting data in catalogues or ledgers and put only a number or placeholder on the specimen label. In these cases, the data is at risk of being disassociated with the label.

When digitising, it is best practice to transcribe the ledgers and match the specimens with the corresponding records. This may take more time but makes the data much easier to use and share in the long run.

Temporary labels

During collecting, specimens are often given a temporary field label. These labels usually have incomplete or abbreviated information, with more accurate data often recorded in the field notebooks.

Temporary labels should be replaced with proper labels immediately after collecting, but this does not always happen. This is a problem because if we lose knowledge about where and when the insect was collected, the scientific value of the specimen is very limited.

Avoid accepting donations of specimen collections that are not properly labelled, or ensure temporary labels are replaced with permanent labels before formal acquisition.

Georeferencing

Georeferencing involves calculating geographic coordinates for a locality, so we can accurately find on a map where the insect was collected.

During digitisation it is best practice, where possible, to calculate and add coordinates for specimens that do not have this data. For more detailed guidance on georeferencing, refer to the Global Biodiversity Information Facility (GBIF):

OS Grid Reference is most commonly used in UK, but it has to be converted to coordinates if the data is to be published with aggregators such as GBIF. The most common and useful coordinate format is decimal degrees.

Born-digital data

When collecting insects nowadays, it is best practice, if possible, to create data in digital form right in the field. We can do this using digital pens and tablets, along with platforms such as Epicollect5. This helps to avoid complications with digitisation and speeds up labelling and databasing of specimens.

Data mobilisation and publication

To be useful, digitised data must be published. This can be done through:

A software interface with a map of the United Kingdom and pins showing the locations of different entomological species.
The interface of the GBIF platform.
The interface of the Natural History Museum's Data Portal
The interface of the Natural History Museum's Data Portal.

Additional resources


Header image credit: Duncan McGlynn.