[크롤링/03] 라이브러리/프레임워크
라이브러리/프레임워크의 종류
-
Anemone (RUBY)
깃 허브 저장소 주소 > github.com/chriskite/anemone
chriskite/anemone
Anemone web-spider framework. Contribute to chriskite/anemone development by creating an account on GitHub.
github.com
소개와 사용 예시 > www.rubyinside.com/web-spidering-with-anemone-1927.html
Easy Web Spidering in Ruby with Anemone
Easy Web Spidering in Ruby with Anemone By Ric Roberts / July 2, 2009 Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some
www.rubyinside.com
-
nokogiri(RUBY)
공식 홈페이지 > nokogiri.org/
Home - Nokogiri
Get support for Nokogiri with a Tidelift subscription Nokogiri Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast
nokogiri.org
RubyGems > rubygems.org/gems/nokogiri/versions/1.6.8
nokogiri | RubyGems.org | your community gem host
Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.
rubygems.org
-
Scrapy(PYTHON)
공식 홈페이지>scrapy.org/
Scrapy | A Fast and Powerful Scraping and Web Crawling Framework
Portable, Python written in Python and runs on Linux, Windows, Mac and BSD
scrapy.org
튜토리얼>docs.scrapy.org/en/latest/intro/tutorial.html
Scrapy Tutorial — Scrapy 2.4.1 documentation
In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide. We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors. Scrapy is written in Python. If
docs.scrapy.org
-
Jsoup(JAVA)
공식 홈페이지> jsoup.org/
jsoup Java HTML Parser, with the best of HTML5 DOM methods and CSS selectors.
jsoup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG H
jsoup.org
Cook Book>jsoup.org/cookbook/
Cookbook: jsoup Java HTML parser
jsoup.org
개인적으로 크롤링 프레임워크/라이브러리들이 튜토리얼을 CookBook이라고 하는게 정말 귀여운 것 같다
-
beautifulsoup(Python)
공식 홈페이지> www.crummy.com/software/BeautifulSoup/bs4/doc/
Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(
www.crummy.com
-
crawler4j(JAVA)
깃 허브 저장소> github.com/yasserg/crawler4j
yasserg/crawler4j
Open Source Web Crawler for Java. Contribute to yasserg/crawler4j development by creating an account on GitHub.
github.com
이건 Python의 Scrapy와 비슷한 역할을 한다고 보면된다. Python에서도 Scrapy로 HTML를 추출하고, 파싱할 때 종종 Beautifulsoup을 사용하는 것처럼Java도 Crawler4j를 사용해서 HTML을 추출하고 Jsoup으로 파싱한다고 한다.
-
Apache Tika(JAVA/ HTML 외에도 다양한 파일에서 데이터 추출 가능)
공식 홈페이지> tika.apache.org/
Apache Tika – Apache Tika
Apache Tika - a content analysis toolkit The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika
tika.apache.org
-
Apache Nutch(JAVA/ 분산 처리 가능)
공식 홈페이지> nutch.apache.org/
Apache Nutch™ -
The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.18, we advise all current users and developers of the 1.X series to upgrade to this release. An account of the CHANGES in this release can be seen in the release report.
nutch.apache.org
-
node-crawler(Node.js)
npm> www.npmjs.com/package/crawler
crawler
Crawler is a web spider written with Nodejs. It gives you the full power of jQuery on the server to parse a big number of pages as they are downloaded, asynchronously
www.npmjs.com
깃 허브 레포지터리> github.com/bda-research/node-crawler
bda-research/node-crawler
Web Crawler/Spider for NodeJS + server-side jQuery ;-) - bda-research/node-crawler
github.com
-
gocrawl(Go)
깃 허브 레포지터리> github.com/PuerkitoBio/gocrawl
PuerkitoBio/gocrawl
Polite, slim and concurrent web crawler. Contribute to PuerkitoBio/gocrawl development by creating an account on GitHub.
github.com
README.md 한국어 번역> github.com/PuerkitoBio/gocrawl/blob/master/doc/ko/README.md
PuerkitoBio/gocrawl
Polite, slim and concurrent web crawler. Contribute to PuerkitoBio/gocrawl development by creating an account on GitHub.
github.com
프레임워크와 라이브러리의 차이점
라이브러리와 프레임워크의 차이는 제어 프름에 대한 주도권이 누구에게/어디에 있느냐에 달려있다.
프레임워크는 전체적인 흐름을 스스로가 쥐고 있으며 사용자는 그 안에서 필요한 코드를 짜넣는 반면에 라이브러리는 사용자가 전체적인 흐름을 만들며 라이브러리를 가져다 쓰는 것이라고 할 수 있다.
프레임워크는 가져다가 사용한다기보다는 거기에 들어가서 사용한다는 느낌으로 접근할 수 있다.
참고: https://webclub.tistory.com/458
프레임워크와 라이브러리의 차이점
Framework Vs Library 프레임워크와 라이브러리의 정확한 차이점은 무엇일까요? 대중 알것 같지만 정확히 어떠한 차이점이 있는지 모르고 있는 경우가 많을지도 모릅니다. 프레임워크는 단지 미리 만
webclub.tistory.com