Crawling

[크롤링/03] 라이브러리/프레임워크

박한결 2021. 3. 22. 12:00

라이브러리/프레임워크의 종류

  • Anemone (RUBY)

깃 허브 저장소 주소 > github.com/chriskite/anemone

 

chriskite/anemone

Anemone web-spider framework. Contribute to chriskite/anemone development by creating an account on GitHub.

github.com

소개와 사용 예시 > www.rubyinside.com/web-spidering-with-anemone-1927.html

 

Easy Web Spidering in Ruby with Anemone

Easy Web Spidering in Ruby with Anemone By Ric Roberts / July 2, 2009 Anemone is a free, multi-threaded Ruby web spider framework from Chris Kite, which is useful for collecting information about websites. With Anemone you can write tasks to generate some

www.rubyinside.com

  • nokogiri(RUBY)

공식 홈페이지 > nokogiri.org/

 

Home - Nokogiri

Get support for Nokogiri with a Tidelift subscription Nokogiri Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast

nokogiri.org

RubyGems > rubygems.org/gems/nokogiri/versions/1.6.8

 

nokogiri | RubyGems.org | your community gem host

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.

rubygems.org

  • Scrapy(PYTHON) 

공식 홈페이지>scrapy.org/

 

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

Portable, Python written in Python and runs on Linux, Windows, Mac and BSD

scrapy.org

튜토리얼>docs.scrapy.org/en/latest/intro/tutorial.html

 

Scrapy Tutorial — Scrapy 2.4.1 documentation

In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide. We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors. Scrapy is written in Python. If

docs.scrapy.org

  • Jsoup(JAVA) 

공식 홈페이지> jsoup.org/

 

jsoup Java HTML Parser, with the best of HTML5 DOM methods and CSS selectors.

jsoup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG H

jsoup.org

Cook Book>jsoup.org/cookbook/

 

Cookbook: jsoup Java HTML parser

 

jsoup.org

 

개인적으로 크롤링 프레임워크/라이브러리들이 튜토리얼을 CookBook이라고 하는게 정말 귀여운 것 같다 

 

  • beautifulsoup(Python)

공식 홈페이지> www.crummy.com/software/BeautifulSoup/bs4/doc/

 

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

  • crawler4j(JAVA) 

깃 허브 저장소> github.com/yasserg/crawler4j

 

yasserg/crawler4j

Open Source Web Crawler for Java. Contribute to yasserg/crawler4j development by creating an account on GitHub.

github.com

이건 Python의 Scrapy와 비슷한 역할을 한다고 보면된다. Python에서도 Scrapy로 HTML를 추출하고, 파싱할 때 종종 Beautifulsoup을 사용하는 것처럼Java도 Crawler4j를 사용해서 HTML을 추출하고 Jsoup으로 파싱한다고 한다. 

 

  • Apache Tika(JAVA/ HTML 외에도 다양한 파일에서 데이터 추출 가능)

공식 홈페이지> tika.apache.org/

 

Apache Tika – Apache Tika

Apache Tika - a content analysis toolkit The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika

tika.apache.org

  • Apache Nutch(JAVA/ 분산 처리 가능)

공식 홈페이지> nutch.apache.org/

 

Apache Nutch™ -

The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.18, we advise all current users and developers of the 1.X series to upgrade to this release. An account of the CHANGES in this release can be seen in the release report.

nutch.apache.org

  • node-crawler(Node.js)

npm> www.npmjs.com/package/crawler

 

crawler

Crawler is a web spider written with Nodejs. It gives you the full power of jQuery on the server to parse a big number of pages as they are downloaded, asynchronously

www.npmjs.com

깃 허브 레포지터리> github.com/bda-research/node-crawler

 

bda-research/node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-) - bda-research/node-crawler

github.com

  • gocrawl(Go)

깃 허브 레포지터리> github.com/PuerkitoBio/gocrawl

 

PuerkitoBio/gocrawl

Polite, slim and concurrent web crawler. Contribute to PuerkitoBio/gocrawl development by creating an account on GitHub.

github.com

README.md 한국어 번역> github.com/PuerkitoBio/gocrawl/blob/master/doc/ko/README.md

 

PuerkitoBio/gocrawl

Polite, slim and concurrent web crawler. Contribute to PuerkitoBio/gocrawl development by creating an account on GitHub.

github.com

 

프레임워크와 라이브러리의 차이점

라이브러리와 프레임워크의 차이는 제어 프름에 대한 주도권이 누구에게/어디에 있느냐에 달려있다.

프레임워크는 전체적인 흐름을 스스로가 쥐고 있으며 사용자는 그 안에서 필요한 코드를 짜넣는 반면에 라이브러리는 사용자가 전체적인 흐름을 만들며 라이브러리를 가져다 쓰는 것이라고 할 수 있다.

프레임워크는 가져다가 사용한다기보다는 거기에 들어가서 사용한다는 느낌으로 접근할 수 있다.

 

참고: https://webclub.tistory.com/458

 

프레임워크와 라이브러리의 차이점

Framework Vs Library 프레임워크와 라이브러리의 정확한 차이점은 무엇일까요? 대중 알것 같지만 정확히 어떠한 차이점이 있는지 모르고 있는 경우가 많을지도 모릅니다. 프레임워크는 단지 미리 만

webclub.tistory.com