首页>
外国专利>
An automatic device for training and classifying documents based on N-gram statistics and An automatic method for training and classifying documents based on N-gram statistics therefor
An automatic device for training and classifying documents based on N-gram statistics and An automatic method for training and classifying documents based on N-gram statistics therefor
The present invention relates to an apparatus for automatically learning documents and a method for automatically learning documents using the same, and an apparatus for automatically classifying documents and a method for automatically classifying documents using the same, which are capable of automatically learning and classifying mass documents on the web through a process of automatically learning and classifying documents based on n-gram. The apparatus for automatically classifying documents according to the present invention includes: a learning document pool including a plurality of learning document groups which are classified according to categories; a preprocessing unit configured to preprocess each of the learning document groups of the learning document pool; and an n-gram data set pool configured to store a set of n-gram data of the learning document pool, which is formed by being learned through the preprocessing of the preprocessing unit. Additionally, the apparatus for automatically classifying documents includes: an automatic document learning unit configured to allow the preprocessing unit to preprocess a corresponding new document to form a bigram set, when the new document occurs, which is not identified through the learning document pool; and an automatic document classifying unit configured to compare the bigram set of the new document, formed through the preprocessing unit, with a bigram set of the n-gram data set pool and to allocate and store the bigram set of the new document to one of n-gram data sets of the n-gram data set pool. [Reference numerals] (220) Automatic document classifying unit; (230) Learned n-gram data set(bigram example); (AA) Non-identified document; (BB) Appearance of a new document; (CC) Preprocessing
展开▼