A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, , affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE® database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages as specific bibliographic data. Results from experiments conducted with 1,149 medical articles from forty-seven journal issues are presented.
Jongwoo Kim, Daniel X. Le, George R. Thoma