In the third episode of Genomic Connections, Christian and Kasia chat with Joana Pauperio, Biodiversity Project Manager at the European Nucleotide Archive, in the European Bioinformatics Institute (EMBL-EBI). They discuss the concept of “metadata,” why it is crucial to ensure high-quality research, and some best practices scientists should follow to ensure the data they produce is FAIR—Findable, Accessible, Interoperable, and Reusable.

You can listen to Genomic Connections on Spotify and PocketCast. The RSS Feed is available here.

The episode’s full Transcript (AI-generated) is available below the credits.


Credits

“Genomic Connections” is a podcast about the science, stories, and people behind biodiversity genomics produced by ⁠ERGA⁠⁠⁠ and ⁠⁠⁠iBOL Europe⁠⁠⁠ within the ⁠⁠Biodiversity Genomics Europe (BGE) project⁠⁠.

“Genomic Connections” is written and produced by Christian de Guttry, Kasia Fantoni, Luisa Marins and Chiara Bortoluzzi.

Graphic design by Luisa Marins.

Music (intro and outro): “⁠⁠Nostalgic Reflections⁠⁠” by Ant.Survila (c) copyright 2025 Licensed under a CC-BY-NC 4.0 license. Ft: airtone.

BGE is a Horizon Europe project funded by the European Commission, the Swiss Confederation and the United Kingdom.

The episode is licensed under a CC-BY 4.0 license.

You can listen to Genomic Connections on Spotify and PocketCast.

Episode #3 Transcript

Hey, Kasia, yesterday, I spent an hour at the airport watching identical black suitcases spin around the belt, and one traveler had to tape a giant pink flamingo to his handle. He arrived, he collected his bag immediately, and I continued playing like a joulet. Ah, I feel you well. Kasia: But this is an excellent metaphor, actually, in genomics, rich sample metadata is that Flamingo tag, a DNA sequence carrying specimen information, permit, code and preservation method leaves the belt all at once, and it’s highly traceable. A sequence without those details exists in scientific Limbo only and is set for lost property.
Imagine a groundbreaking biodiversity study making the headlines only for scientists to realize that the DNA records lack specimen information without those ties, the datasets fails every fair test. Future researchers cannot replicate the work or build on it, and months of field effort drift into the same scientific limbo, indeed, and for these reasons, today, we will explore how standardized metadata and fare principles turn raw reeds into fully ticketed passengers bags that can travel the world of research. Our guest works daily to attach those Flamingo tags before the flight takes off. I am Kasia Fantoni. I’m a Europe Community Manager,  and I am Christian de Guttry, erga project manager, and you’re listening to genomic connections, a podcast about biodiversity and genomics. Let’s get started.
For this episode, we are joined by Joana Pauperio. Joana is a biodiversity project manager at the European nucleotide archive in the European Bioinformatics Institute. Joana works on biodiversity data coordination, and she especially works with the community, understanding their needs and supporting data structuring and submission to sequencing archives. She’s also involved in a number of projects and initiatives working toward fair biodiversity, genomics data and infrastructure linking Joanna is also the CO lead of the elixir biodiversity community,
the European nucleotide archive, also known as ena, is an open platform for the management, sharing, integration and dissemination of sequence data. Ena holds a globally comprehensive sequence record and is the European node of the International nucleotide sequence database. Collaboration together with the NCBI in the US and ddbj in Japan. Ena is managed and operated by the European Bioinformatics institutes, EMBL EBI.
Let’s get started in biodiversity research, we also hear about metadata. What exactly does it mean? Metadata? So metadata, I guess we can describe it formally as a set of data that describes and provides context to other data.
So this is probably not a very easy concept, because we’re saying data a lot. So as an example, if we look into specimen or sample metadata, when you use a specimen to collect information, could be some morphological information, could be some biological traits, the metadata would be the information that gives context on the sample, so what species it is, what it was collected, when that helps then interpret the data. So I guess that that’s a more easy to understand definition of metadata. Yeah,
okay, and give us the 32nd page. Why should anyone care about good metadata for barcodes or reference genomes?
Yeah, well, if, I guess, if we’re building reference information, right, if we’re producing barcodes and reference genomes, I guess the whole idea of this would be that this data that we’re producing is then used as reference in other studies in the future,
and so that we can actually use it. And for this to be useful, we need to have context, and that’s what all good metadata is about.
So with our barcodes and our reference genomes, we need to include information of what species they belong to. Otherwise it’s it’s not very relevant where we collected it, when add some biological information of the organism, and that will help us contextualize and the data.
Data and help in interpret, interpret the data. And the more relevant metadata we add, the more correct our interpretations will be, and the more useful this will be towards the future.
Yeah, if DNA is the book of life, metadata is kind of a sticky note left on the cover with three details must always be on that sticky note, so you can trust a data set.
I like your sticky notes imaging. That’s very nice. Okay, so I think
the three most important basic things are taxonomic identification. In this case, I mean, if we have a barcode or reference genome, we need to know the species it’s referring to, or at least something to put it in the tree of life.
Location of origin, where it came from, could be as simple as the country where it came from. The more information, the better date of collection, when it was collected, so you can put it on a temporal scale, I would say these three are the most important ones.
Yeah, it sounds like also metadata are kind of the, let’s say, legacy for the future researchers that would like to reuse the data from public repositories. Yes, and nowadays, we hear all time that scientific data and metadata should follow the fair principles, meaning they should be findable, Accessible, Interoperable and reusable. Could you explain what this mean in practice and how this principle have improved your worked? Your work. Okay, so this, this is kind of a long one, I guess, because there is a lot of things so,
so maybe let’s, let’s break it down a bit. So the the F and fair is all about findability. So if we want the data to be findable, we need to have it again, have this rich metadata that we have been talking about before,
and we need this metadata to be indexed, to be searchable, so that people can find them when the data is stored. And also very important is that we should use persistent identifiers, so IDs that we attached to the data
or metadata, and that allows people to find it more easily. So you know, if you use that ID, you will always find it
accessibility.
It’s all about having a communication protocols that are standardized, having data open, allowing for authentication when things are needed, and making sure that the metadata is accessible again. It also connects with the findability. Interoperability is more about communication. So communication between databases, communication within machines. So we need to make sure that the metadata as it is put in is readable and understandable by both people and machines. So we need to use languages that are standardized. We need to use languages that are broadly applicable, and we need to have these links between things. And then the last one, the reusability is, is a little bit of all that we talked about until now. So we need to have rich metadata. Again, we need to have things open, and we also need to track provenance of the data. That’s also very important thing about reusability, and again, the standardization and the meeting the standards that are built by the community. How does this help my work? Or how does this influence my work? I think
the work that I do, which is mostly interacting with the community and making sure that we can help them make their data available, being guided by these principles, help us in this communication. It helps us also understand what we need to do to make things better for the community and how we can help them actually make them Data Fair. So I think these are guiding principles that we should follow.
Thanks a lot, Joanna and talking about communities and serving communities. So,
within BGE, we have two communities, and one is the barcoding DNA barcoding one which relies on like and DNA barcoding relies on short reads while reference genomes run on gigabytes. So how do you manage single.
Metadata model that serves both communities. So the communities are different, but they are they overlap quite a bit, and there are, though, there are metadata that is specific to some of the methods and to some of the pipelines that are used in both communities. A lot of it is
of overlaps. So
if we look, for example, at sample metadata, they all all of these analysis rely on specimens or in organisms or in samples, and the differences actually, in this case, will be more from where the sample originates, and not so much the type of analysis that we’re going to do with it, in this case in terms of short reads or long reads,
because, for example, if a specimen comes from, it’s freshly collected. We have much more information that is available on them because we’re collecting it now. If it comes from a museum, then the information that we have there may be a little bit limited, because it was collected in the past. And sometimes we have to adjust the way that the metadata is collected so that people are aware that in some cases we have more information available. In other cases, we have less information available, and that’s the kind of exercise that we have been done doing in this project. So trying to understand, trying to map what metadata can be collected, and trying to adjust it to their specific needs. So I guess that’s what you could call a single metadata model that has some specificities for the different communities, but, but that there, these are maps so we can
come from one to the other one, I guess.
Yeah, it makes sense. You repeated the word collected, collected several times. When you talk about collection, I think about people going on the field, especially in these two communities. And when you go to the field, sometimes you have a piece of paper and a pencil to collect the metadata. That has to be extremely precise. So I think that common mistake always happen.
Could you describe a true metadata nightmare that you experience something that happened to you or that you witness in the past, and are there any guidelines to avoid such errors? If we’re calling it a nightmare, that’s quite strong.
I guess thinking,
I guess the worst, or one of the worst things that can happen is that you, I don’t know, swap IDs between specimens when you have the data. I think that’s the worst case, and that has repercussions, because if the data is submitted, then you will have to cancel some of it, or you’re going to have to update it. Sometimes it’s difficult to track back where the error was, so that causes difficult errors to
solve.
And I think I mean, the way to solve these type of errors is always be very careful in tracking the data and try to be very precise, as you said, in terms of the data collection and also all through the processes for lab analysis and data
analysis, and I would say there are a lot of other smaller and calmer errors with for example, sample coordinates, dates, some of them can be easily identified. Some are very difficult to understand that it’s actually an error. It needs to be the researcher that actually spots it and often also what we see aren’t exactly errors, but it’s more like difficulties in providing information in the needed format so that is better to interpret. For example, when referencing to a specimen, people usually, and if specimen is kept in a museum,
people usually provide the ID, but sometimes the Id only is not enough. Within an institution, we have some guidelines on how to provide that, but if people don’t follow those guidelines, then it becomes difficult to do these linking and for the metadata to actually be useful. So I think there’s different levels of errors and different ways of getting around them and having a good metadata collected. A question on top of this that I’m curious about your opinion. So do you think that some of those mistakes that we as researchers made are made because we don’t give
to metadata the importance that they need
still today in our community, is this increasing
the value of the metadata, the perception of the value of the metadata, or we still have a lot of work to do.
Uh,
okay, so I think the mistakes don’t necessarily come from the lack of importance. I think some mistakes are are some mistakes are just bad mistakes, some, some may come from the lack of knowledge of I mean, you, you may know that metadata is important, important, but you may not necessarily know how to deal with it and how to provide it. So one thing is to collect it, the other thing is to put it out there for the public. So I think there’s still a lot of training needed in this sense, and a lot of communication towards the researchers. I do think there is an increasing awareness of how important metadata is, not only for your own study, because I think that the researchers are usually aware, but also for the future that you don’t. You don’t need to just make your data available, but you need that context too. So I think that’s increasing, but there’s still work that needs to be done there.
So talking about the future and metadata that will get to the future,
imagine that in 50 years, somebody will reuse that assets that you worked on. What do you hope that future scientists will be able to do because of the metadata that was rock solid?
So if my metadata is rock solid, that would be great, and if I’m able to link it appropriately so metadata with data and different data sets by using the persistent identifiers, then I would expect first that future sciences will be able to find and access it. I think that’s the first thing they would need to find it.
Then, I guess because they are able to understand its context, because it’s very good metadata,
they could do a lot of things. I mean, they could be analyze it with innovative methods, still with AI, like very advanced things. They could make predictions, analyze trends. I guess everything is out in the open to be done.
Yeah, yeah, I agree, but today, many labs are still afraid of data entry. Days. Have you discovered one practical trick or tools that turns metadata capture from an adic into an habit like wizards or electronic field notebook or something like this.
So I don’t think that there are magic tools that make metadata collection like completely hassle free. So it is, it is something that requires work. Could be made simpler, could be made more difficult, depending on the way we do things, I would say planning ahead is my first advice. So understanding what we need to be to collect or to capture, where we will share this information, what are the standards that we need to use to be able to share this information? That’s the first step. And then, I mean, there’s, there’s a lot of tools that can help you, there’s some apps that help you capture data from the field that they already validate information to standards. There are brokering systems so so tools or groups of entities, groups of people, that help people actually validate this metadata and then make this public. So there are a few things that exist and that can be used, but I don’t see yet a likely magic tool or a single thing that can be used for it.
Oh, that’s a pity. Okay, but then
going back to the two communities, erga and Ivo Europe,
can you share one good practice that barcodes could learn from Genome builders about metadata and vice versa? Well, I think this one is a bit difficult, because I think both communities have been evolving together and improving their metadata
practices together,
maybe thinking on things that so things that were crossed over between the communities. I think the barcoding community also always had a sense of the importance of the specimen information and vouchering to link with the barcodes. I think that was something that passed on into the genome builders that this is always very important.
In terms of referencing, and then maybe in terms of the genome community, they started early understanding that they needed help with their metadata and setting up brokering systems. So this is tools that help people validate the information after it is collected and before it is shared. And I think these this is a good example, also that the barcoding community may follow. So I would say probably these two things can be used as example, but they are the communities are working together and evolving together in this aspect.
Fantastic
and now a personal twist. Let’s say, if you could tag one fictional character as the metadata manager for the day, who would be and why everybody’s allowed.
Okay, my first thought was Spock from Star Trek.
So he’s very rational, very thorough. He has a lot of analytical skills. He’s very committed. I mean, he needs to do that. He does that. So I think he would be a good metadata manager.
Then I started thinking a little bit on fairies, because they could do some magic in some of the processes to make them easier. Those would be my Yeah, my suggestions. Okay. Thank you very much today,
this podcast is brought together by biodiversity, genomics Europe, Project founded by the European Union, the Swiss Confederation and the United Kingdom.

Published On: June 19th, 2025 / Categories: All / Tags: , /