This paper describes how use the Java Swing HTMLEditorKit to perform multi-threaded web data mining on the EDGAR system (Electronic DataGathering, Analysis, and Retrieval system). EDGAR is the SEC’s (U.S. Securities and Exchange Commission) means of automating the collection, validation, indexing, acceptance, and forwarding of submissions. Some entities are regulated by the SEC (e.g. publicly traded firms) and are required, by law, to file with the SEC. Our focus is on making use of EDGAR to get information about company filings. These offers are filed with companies, using their Central Index Key (CIK). The CIK is used on the SEC’s computer system to identify entities that filed a disclosure with the SEC. We show how to map a stock ticker symbol into a CIK. The methodology for converting the web data source into internal data structures is based on using HTML as the input into a context-sensitive parser-callback facility. Screen scraping is a popular means of data mining, but the...
Dougal A. Lyon