Representative sequences

In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences.[1] In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence.[2][3]

Social sciences

edit
 
Representative sequences covering 27% of 2000 cohabitation sequences between age 15 and 30 (extract of biographical data from the Swiss Household Panel)

In Sequence analysis in social sciences, representative sequences are used to summarize sets of sequences describing for example the family life course or professional career of several thousands individuals.[4]

The identification of representative sequences[1][4] proceeds from the pairwise dissimilarities between sequences. One typical solution is the medoid sequence, i.e., the observed sequence that minimizes the sum of its distances to all other sequences in the set. An other solution is the densest observed sequence, i.e., the sequence with the greatest number of other sequences in its neighborhood. When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set. In such cases, an as small as possible set of representative sequences covering (i.e., which includes in at least one neighborhood of a representative) a given percentage of all sequences is searched.

A solution also considered is to select the medoids of relative frequency groups. More specifically, the method consists in sorting the sequences (for example, according to the first principal coordinate of the pairwise dissimilarity matrix), splitting the sorted list into equal sized groups (called relative frequency groups), and selecting the medoids of the equal sized groups.[5]

The methods for identifying representative sequences described above have been implemented in the R package TraMineR.[6]

Bioinformatics

edit

Representative sequences are short regions within protein sequences that can be used to approximate the evolutionary relationships of those proteins, or the organisms from which they come. Representative sequences are contiguous subsequences (typically 300 residues) from ubiquitous, conserved proteins, such that each orthologous family of representative sequences taken alone gives a distance matrix in close agreement with the consensus matrix.[7]

Protein sequences can provide data about the biological function and evolution of proteins and protein domains. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the evolutionary development of biological processes on earth; such sequence clusters allow for the effective coverage of sequence space. Sequence clusters can reduce a large database of sequences to a smaller set of sequence representatives, each of which should represent its cluster at the sequence level. Sequence representatives allow the effective coverage of the original database with fewer sequences. The database of sequence representatives is called non-redundant, as similar (or redundant) sequences have been removed at a certain similarity threshold.

See also

edit

Sequence analysis in social sciences

Sequence analysis in bioinformatics

References

edit
  1. ^ a b Gabadinho, Alexis; Ritschard, Gilbert; Studer, Matthias; Müller, Nicolas S. (2011), Fred, Ana; Dietz, Jan L. G.; Liu, Kecheng; Filipe, Joaquim (eds.), "Extracting and Rendering Representative Sequences", Knowledge Discovery, Knowledge Engineering and Knowledge Management, Communications in Computer and Information Science, vol. 128, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 94–106, doi:10.1007/978-3-642-19032-2_7, ISBN 978-3-642-19031-5, retrieved 2023-06-12
  2. ^ Kuri-Morales, Angel F.; Ortiz-Posadas, Martha R. (2005), Gelbukh, Alexander; de Albornoz, Álvaro; Terashima-Marín, Hugo (eds.), "A New Approach to Sequence Representation of Proteins in Bioinformatics", MICAI 2005: Advances in Artificial Intelligence, vol. 3789, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 880–889, doi:10.1007/11579427_90, ISBN 978-3-540-29896-0, retrieved 2023-06-12
  3. ^ Chen, William L.; Leland, Burton A.; Durant, Joseph L.; Grier, David L.; Christie, Bradley D.; Nourse, James G.; Taylor, Keith T. (2011-09-26). "Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics". Journal of Chemical Information and Modeling. 51 (9): 2186–2208. doi:10.1021/ci2001988. ISSN 1549-9596. PMID 21800899.
  4. ^ a b Gabadinho, Alexis; Ritschard, Gilbert (2013). Levy, René; Widmer, Eric D. (eds.). "Searching for typical life trajectories, applied to childbirth histories". Gendered Life Courses, Between Standardization and Individualization: A European Approach Applied to Switzerland. Zurich: LIT: 287–312.
  5. ^ Fasang, Anette Eva; Liao, Tim Futing (2014). "Visualizing Sequences in the Social Sciences: Relative Frequency Sequence Plots". Sociological Methods & Research. 43 (4): 643–676. doi:10.1177/0049124113506563. hdl:10419/209702. ISSN 0049-1241. S2CID 61487252.
  6. ^ Gabadinho, Alexis; Ritschard, Gilbert; Müller, Nicolas S.; Studer, Matthias (2011). "Analyzing and Visualizing State Sequences in R with TraMineR". Journal of Statistical Software. 40 (4). doi:10.18637/jss.v040.i04. ISSN 1548-7660.
  7. ^ Bern, Marshall; Goldberg, David (November 2, 2004). "Automatic selection of representative proteins for bacterial phylogeny". BMC Evolutionary Biology. 5 (34): 34. doi:10.1186/1471-2148-5-34. PMC 1175084. PMID 15927057.