Christopher D. Paice

Christopher D Paice was one of the pioneers of research into stemming. The Paice-Husk stemmer was published in 1990 and his method of evaluation of stemmer performance by means of Error Rate with Respect to Truncation (ERRT) was the first direct method of comparing under-stemming and over-stemming errors. Apart from his pioneering work on stemming algorithms and evaluation methods he made other research contributions in the area of Information Retrieval, anaphora resolution and automatic abstracting.[1] [2]

Teaching career

Christopher D Paice was a member of the School of Computing and Communications (SCC) at Lancaster University, United Kingdom for around forty years, initially joining the then Department of Computer Studies as a Research Associate in 1969-70; then moving on to a Lectureship. He was acting Head of Department in 1977-78, Head of Department 1979-82 and retired in 2009.[3]

The Paice-Husk Stemming Algorithm

The Paice-Husk Stemmer was developed by Chris D Paice with the assistance of Gareth Husk in the Computing Department at Lancaster University, in the late 1980s, it features an externally stored set of stemming rules, and this flexibility over the Porter stemmer made it of interest to several researchers.[4]

Originally implemented in Pascal programming language, further implementations have been made using ANSI C and Java. A Perl version was implemented by Mary Taffet at the Center for Natural Language Processing at Syracuse University, USA.[5]

The stemmer consists of a stemming algorithm and a separate set of stemming rules. The standard set of rules provides a 'strong' stemmer. Stemmer strength is a quality that is advantageous for index compression, however, it produce a larger number of Overstemming errors relative to the number of Understemming errors; users who need a lighter stemmer can easily develop their own set of rules.

The Stemmer is iterative (i.e. endings are removed piecemeal in an indefinite number of stages) and the rules may specify the removal or replacement of an ending. The replacement technique avoids the need for a separate stage in the process to recode or provide partial matching; this helps maintain the efficiency of the algorithm. The rules are indexed by the last letter of the ending to allow efficient searching.[6]

Stemmer Evaluation

Apart from the Stemmer itself, Chris Paice developed a method for directly measuring the performance of stemmers using grouped lists of words applied to the stemmer, counting the number of overstemming and understemming errors, then comparing the results with what would have been obtained by using a set of truncation stemmers. The final measure being the Error Rate Relative to Truncation (ERRT).[7] [8]

Personal life

Christopher D Paice was born in 1941, he married Kathleen F Moss in 1965 in the Manchester Registration district. In 2015 he was diagnosed with an aggressive brain tumour, shortly after he and his wife moved away from Cumbria to Stratford, he passed away 21 April 2016.

Publications

  • C D Paice (1977). Information Retrieval and the Computer. Macdonald and Jane's, London.
  • C D Paice (1980). Proceedings SIGIR '80 The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases. Butterworth. ISBN 0-408-10775-8.
  • C D Paice (1984). Information Technology Research Development Applications: Volume 3 Issue 1, Soft evaluation of Boolean search queries in information retrieval systems. Butterworth.
  • C D Paice; V. Aragón-Ramírez (1985). RIAO '85: Recherche d'Informations Assistée par Ordinateur, The calculation of similarities between multi-word strings using a thesaurus. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE.
  • C D Paice (1986). ASLIB Proceedings: Volume 38 Issue 10, Expert systems for information retrieval?. Aslib, The Association for Information Management.
  • C D Paice (1990). Information Processing and Management: an International Journal, Volume 26 Issue 1 Constructing literature abstracts by computer: techniques and prospects. Pergamon Press, Inc.
  • C D Paice (1990). Information Processing and Management: an International Journal, Volume 27 Issue 5 A thesaural model of information retrieval. Pergamon Press, Inc.
  • C D Paice (1991). ACM SIGIR Forum: Volume 24 Issue 3 Another stemmer. ACM.
  • F. C. Johnson; C. D. Paice; W. J. Black; A. P. Neal (1997). Readings in information retrieval: The application of linguistic processing to automatic abstract generation. Morgan Kaufmann Publishers Inc.
  • Michael B. Twidale; David M. Nichols; Chris D. Paice (1997). Information Processing and Management: an International Journal: Volume 33 Issue 6, Browsing is a collaborative process. Pergamon Press, Inc.
  • Michael P. Oakes; C. D. Paice (1999). IRSG'99: Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research The automatic generation of templates for automatic abstracting. BCS.
  • C. D. Paice (2009). Lexical Analysis of Textual Data. Encyclopedia of Database Systems. Springer, US. pp. 1606–1610. ISBN 978-0-387-35544-3.
  • C. D. Paice (2009). Stemming. Encyclopedia of Database Systems. Springer , US. pp. 2790–2793. ISBN 978-0-387-35544-3.

References

  1. , University Trier, DBLP Computer Science Bibliography
  2. , ACM Author page, C D Paice
  3. , Lancaster University, In Memory of Chris Paice
  4. , Improvements to the Lancaster Stemming Algorithm (Paice-Husk Stemmer), Antonio Zamora
  5. , GitHub, Paice-Husk Stemmer in several languages
  6. "Paice/Husk Stemmer". Archived from the original on 2006-08-22. Retrieved 2006-08-22.
  7. Paice, C.D., (1994) An evaluation method for stemming algorithms, in Croft, W.B. & van Rijsbergen, C.J. (eds.), Proceedings of the 17th ACM SIGIR conference held at Dublin, July 3–6, 1994; pp. 42-50.
  8. Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632-649
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.