Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

198

PAKDD
2009
ACM

116views Data Mining» more PAKDD 2009»

Scalable Web Mining with Newistic

16 years 1 months ago

Scalable Web Mining with Newistic

Download www.horatiumocian.com

Abstract. Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.

Ovidiu Dan, Horatiu Mocian

Real-time Traffic

Data Mining | Data Mining Functions | PAKDD 2009 | Threshold Clustering Algorithm | Web Mining Platform |

claim paper

Related Content

» Web Mining in Search Engines

» Towards OnLine Analytical Mining in Large Databases

» Mining the Most Interesting Web Access Associations

» A Latent Usage Approach for Clustering Web Transaction and Building User Profile

» Multiple Instance Learning with MultiObjective Genetic Programming for Web Mining

» Parallel Strands A Preliminary Investigation into Mining the Web for Bilingual Text

» A personalized recommender system based on web usage mining and decision tree induction

» A framework for mining evolving trends in Web data streams using dynamic learning and retr...

» Using retrieval measures to assess similarity in mining dynamic web clickstreams

Post Info
More Details (n/a)

Added	20 May 2010
Updated	20 May 2010
Type	Conference
Year	2009
Where	PAKDD
Authors	Ovidiu Dan, Horatiu Mocian

Comments (0)