python, R, vimでデータマイニング

python, R, vim で疑問に思ったことなどを

tm:言語処理とちょっとだけ嵌る

嵌ったこと

言言処理の前処理が必要でtmパッケージを使用しました。
参考情報は下記

Basic Text Mining in R

stemDocument関数で英単語が集計しやすい形で出力されるはずなのにされない。

参考情報に従って問題を確認

データはacqを使用。

前処理内容
1. ピリオド除去
2. 数字除去
3. 小文字化
4. 前置詞、代名詞など集計に必要のない単語の除去
5. 複数形など除去
6. 余分な空白や改行など除去

実行結果を確認すると元々PlainTextDocumentクラスがcharacterクラスになり、
複数形などの除去が失敗しています。

library(tm)  
## Loading required package: NLP
# 使用するデータacqの確認  
data(acq)  
class(acq[[1]])  
## [1] "PlainTextDocument" "TextDocument"
acq[[1]]$content  
## [1] "Computer Terminal Systems Inc said\nit has completed the sale of 200,000 shares of its common\nstock, and warrants to acquire an additional one mln shares, to\n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\n    The company said the warrants are exercisable for five\nyears at a purchase price of .125 dlrs per share.\n    Computer Terminal said Sedio also has the right to buy\nadditional shares and increase its total holdings up to 40 pct\nof the Computer Terminal's outstanding common stock under\ncertain circumstances involving change of control at the\ncompany.\n    The company said if the conditions occur the warrants would\nbe exercisable at a price equal to 75 pct of its common stock's\nmarket price at the time, not to exceed 1.50 dlrs per share.\n    Computer Terminal also said it sold the technolgy rights to\nits Dot Matrix impact technology, including any future\nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000\ndlrs. But, it said it would continue to be the exclusive\nworldwide licensee of the technology for Woodco.\n    The company said the moves were part of its reorganization\nplan and would help pay current operation costs and ensure\nproduct delivery.\n    Computer Terminal makes computer generated labels, forms,\ntags and ticket printers and terminals.\n Reuter"
# ピリオド除去  
acqr <- tm_map(acq, removePunctuation)    
  
# 数字除去  
acqr <- tm_map(acqr, removeNumbers)    
  
# 小文字化  
acqr <- tm_map(acqr, tolower)  
  
# 前置詞、代名詞など集計に必要のない単語の除去  
acqr <- tm_map(acqr, removeWords, stopwords("english"))  
  
# 複数形など除去  
library(SnowballC)   
acqr <- tm_map(acqr, stemDocument)  
  
# 余分な空白や改行など除去  
acqr <- tm_map(acqr, stripWhitespace)  
  
# 前処理結果確認  
class(acqr[[1]])  
## [1] "character"
acqr[[1]]  
## [1] "computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares sedio nv lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminals outstanding common stock certain circumstances involving change control company company said conditions occur warrants exercisable price equal pct common stocks market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements woodco inc houston tex dlrs said continue exclusive worldwide licensee technology woodco company said moves part reorganization plan help pay current operation costs ensure product delivery computer terminal makes computer generated labels forms tags ticket printers terminals reut"

原因

原因はstemDocument関数ではなく、tolower関数。
tmパッケージ内の関数はPlainTextDocumentクラスに適用する関数で問題ないが
tolower関数はbaseパッケージ内の関数。返り値が“character”になってしまう。

具体例

library(tm)  
  
# テスト用の文字列を作成  
testtext <- PlainTextDocument("aAa 12 aFoiewjfa")  
class(testtext)  
## [1] "PlainTextDocument" "TextDocument"
# 数字を除去してみる  
r1 <- removeNumbers(testtext)  
class(r1)  
## [1] "PlainTextDocument" "TextDocument"
r1$content  
## [1] "aAa  aFoiewjfa"
# 小文字化してみる  
  
r2 <- tolower(testtext)  
class(r2)  
## [1] "character"
r2  
## [1] "aaa 12 afoiewjfa"

解決策

content_transformerをかますことで解決できます。

data(acq)  
  
# ピリオド除去  
acqr <- tm_map(acq, removePunctuation)    
  
# 数字除去  
acqr <- tm_map(acqr, removeNumbers)    
  
# 小文字化  
acqr <- tm_map(acqr, content_transformer(tolower))  
  
# 前置詞、代名詞など集計に必要のない単語の除去  
acqr <- tm_map(acqr, removeWords, stopwords("english"))  
  
# 複数形など除去  
library(SnowballC)   
acqr <- tm_map(acqr, stemDocument)  
  
# 余分な空白や改行など除去  
acqr <- tm_map(acqr, stripWhitespace)  
  
# 前処理結果確認  
class(acqr[[1]])  
## [1] "PlainTextDocument" "TextDocument"
acqr[[1]]$content  
## [1] "comput termin system inc said complet sale share common stock warrant acquir addit one mln share sedio nv lugano switzerland dlrs compani said warrant exercis five year purchas price dlrs per share comput termin said sedio also right buy addit share increas total hold pct comput termin outstand common stock certain circumst involv chang control company compani said condit occur warrant exercis price equal pct common stocks market price time exceed dlrs per share comput termin also said sold technolgi right dot matrix impact technolog includ future improv woodco inc houston tex dlrs said continu exclusive worldwid license technolog woodco compani said move part reorganization plan help pay current oper cost ensure product delivery comput termin make comput generat label forms tag ticket printer terminals reuter"