Browse by author
Lookup NU author(s): Dr Michael Bell, Dr Colin GillespieORCiD, Dr Phillip Lord
Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un- annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this paper. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf’s Principle of Least Effort. We use UniProtKB as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually-curated annotations.By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.Source code and supplementary data are available at: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation/
Author(s): Bell MJ, Gillespie CS, Swan D, Lord P
Publication type: Report
Publication status: Published
Series Title: School of Computing Science Technical Report Series
Year: 2012
Pages: 7
Print publication date: 01/06/2012
Source Publication Date: June 2012
Report Number: 1336
Institution: School of Computing Science, University of Newcastle upon Tyne
Place Published: Newcastle upon Tyne
URL: http://www.cs.ncl.ac.uk/publications/trs/papers/1336.pdf