Tags:
classification2Add my vote for this tag knowledge_base1Add my vote for this tag navigation1Add my vote for this tag create new tag
, view all tags

Automatic Document Classification

also: unsupervised document classification

CategoryOrganizingPrinciples

Motivation

The weakest point of document classification (for example to enable FacetedNavigation, related content navigation or relevance measures) is that it relys on the authors to classify their documents on their own without any extra provisions so far. Doing so fully manually (supervised) has the drawbacks that
  • authors don't agree on ontologies
  • its a daunting task
  • ontologies need maintenance themselves
  • they tend to be baffling and sometimes counterintuitive or even artificial for several tasks
  • authors are not interested in ontologies; they want to write
  • the "surprise" factor is low using manual document classification; for the author himself, who bravely classifies his documents, the value in return is very low

Supervised document classification nevertheless has its application when you definitely must be sure about document relations, i.e. related products on a shopping site.

Ok, then let's have a look into research literature which methods are available that allow to (semi-)automate the task of content classification.

-- MichaelDaum - 03 Jul 2005

Techniques

Publications

Managing Content with Automatic Document Classification

Rafael A. Calvo, Jae-Moon Lee and Xiaobo Li (2004), Journal of Digital Information, pdf

Abstract: News articles and Web directories represent some of the most popular and commonly accessed content on the Web. Information designers normally define categories that model these knowledge domains (i.e. news topics or Web categories) and domain experts assign documents to these categories. The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Na´ve Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.

see also: http://www.steptwo.com.au/columntwo/archives/001306.html (online forum)

keywords: naive bayesian


Information Retrieval

Second Edition, keith@dcsPLEASENOSPAM.gla.ac.uk

Conferences & Journals

Research Groups

Implementations

Online Demos

Discussion

Harrr, citeseer is currently down. Please follow Melluci's homepage, there are tons of publications available online.

-- MichaelDaum - 03 Jul 2005

Interesting overview.

Glancing over the article I see that automatic classification is gradually getting better. Well, the last time I did some serious research on this was in 1998. The progress is not revolutionary, but it is encouraging.

While describing the content is the most important and difficult task - classification is more than describing subject matter. These other aspects can be automated too for a large part, but they need a different approach.

FacetedNavigation uses all kinds of attributes of a document. Attributes may be the authors, the first publication date, the latest modification date, the lenght of the document, and the kind of document. The latter is what we now use webs for: a support topic, a dev topic, a doc topic - automatically assigning a 'kind of topic' is difficult if the topic does contain little text.

So yes, we need automatic classification for subject matter. But still we need other tools to categorize topics in other ways.

-- ArthurClemens - 03 Jul 2005

Good link to AI::Categorize documentation: http://search.cpan.org/~kwilliams/AI-Categorizer-0.07/lib/AI/Categorizer.pm

-- PeterNixon - 03 Jul 2005

http://en.wikipedia.org/wiki/Document_classification

-- MichaelDaum - 03 Jul 2005

This work will become part of the upcoming ClassificationPlugin.

-- MichaelDaum - 25 Aug 2008

BasicForm
TopicClassification BrainstormingIdea
TopicSummary Resources about Automatic Document Classification: essential readings and how to integrate into a wiki
InterestedParties

RelatedTopics WhyWebsAreABadIdea, FacetedNavigation, MultiLevelWikiWebs
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2008-08-25 - MichaelDaum
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.