PDF

Data

[자연어처리]파이썬에서 워드, PDF, RSS 읽고 말뭉치(corpus) 생성하기

2021.03.12

1. 라이브러리 설치하기 # docx pip install python-docx # pdf pip install pypdf2 # rss pip install feedparser # corpus # nltk 설치 후 www.nltk.org/data.html에서 데이터를 다운로드 할 것 pip install nltk 2. 워드 읽기 import docx def read_docx(filename): file = docx.Document(filename) content = [] for p in file.paragraphs: content.append(p.text) # print('단락 스타일:', p.style) # print('단락 수:', len(file.paragraphs)) return '\n'.join..

[자연어처리]파이썬에서 워드, PDF, RSS 읽고 말뭉치(corpus) 생성하기

티스토리툴바