Scrapling：一款轻量级自适应Web爬虫工具

官方公众号企业安全新浪微博

FreeBuf.COM网络安全行业门户，每日发布专业的安全资讯、技术剖析。

FreeBuf+小程序把安全装进口袋

工具

^{0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9}

Scrapling：一款轻量级自适应Web爬虫工具

Alpha_h4ck 2024-11-22 15:10:38 118891

所属地广西

关于Scrapling

Scrapling 是一款高性能、智能的 Python 网页抓取库，可自动适应网站变化，同时性能远超其他热门工具。无论是初学者还是专家，Scrapling 都能提供强大的功能，同时保持简单性。

>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher

# Fetch websites' source under the radar!

>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)

>> print(page.status)

200

>> products = page.css('.product', auto_save=True)  # Scrape data that survives website design changes!

>> # Later, if the website structure changes, pass `auto_match=True`

>> products = page.css('.product', auto_match=True)  # and Scrapling still finds them!

功能介绍

1、支持按照您的喜好获取网站；
2、自适应爬取，智能内容爬取；
3、运行速度快，内存高效，快速JSON序列化；
4、强大的导航API，富文本处理；
5、支持自动选择器生成；
6、提供了与Scrapy/BeautifulSoup类似的API；

工具要求

Python 3.8+

工具安装

由于该工具基于Python 3开发，因此我们首先需要在本地设备上安装并配置好最新版本的Python 3环境。

pip安装

pip3 install scrapling

Windows

camoufox fetch --browserforge

macOS

python3 -m camoufox fetch --browserforge

Linux

python -m camoufox fetch --browserforge

基于 Debian 的发行版

sudo apt install -y libgtk-3-0 libx11-xcb1 libasound2

基于 Arch 的发行版

sudo pacman -S gtk3 libx11 libxcb cairo libasound alsa-lib

源码获取

广大研究人员可以直接使用下列命令将该项目源码克隆至本地：

git clone https://github.com/D4Vinci/Scrapling.git

工具使用

智能导航

>>> quote.tag

'div'

 

>>> quote.parent

<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>

 

>>> quote.parent.tag

'div'

 

>>> quote.children

[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,

 <data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,

 <data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]

 

>>> quote.siblings

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

 <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

...]

 

>>> quote.next  # gets the next element, the same logic applies to `quote.previous`

<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>

 

>>> quote.children.css_first(".author::text")

'Albert Einstein'

 

>>> quote.has_class('quote')

True

 

# Generate new selectors for any element

>>> quote.generate_css_selector

'body > div > div:nth-of-type(2) > div > div'

 

# Test these selectors on your favorite browser or reuse them again in the library's methods!

>>> quote.generate_xpath_selector

'//body/div/div[2]/div/div'

如果你的案例需要的不仅仅是元素的父元素，你可以像下面这样迭代任何元素的整个祖先树：

for ancestor in quote.iterancestors():

    # do something with it...

您可以搜索满足函数的元素的特定祖先，您需要做的就是传递一个以Adaptor对象作为参数的函数，并返回True条件是否满足，否则False，如下所示：

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))

<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>

基于内容的选择和查找相似元素

可以通过多种方式根据文本内容选择元素，以下是另一个网站上的完整示例：

>>> page = Fetcher().get('https://books.toscrape.com/index.html')

 

>>> page.find_by_text('Tipping the Velvet')  # Find the first element whose text fully matches this text

<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>

 

>>> page.find_by_text('Tipping the Velvet', first_match=False)  # Get all matches if there are more

[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]

 

>>> page.find_by_regex(r'£[\d\.]+')  # Get the first element that its text content matches my price regex

<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>

 

>>> page.find_by_regex(r'£[\d\.]+', first_match=False)  # Get all elements that matches my price regex

[<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,

 <data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,

 <data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,

 <data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,

 ...]

查找位置和属性与当前元素相似的所有元素：

# For this case, ignore the 'title' attribute while matching

>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])

[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,

 <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,

 <data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,

...]

 

# You will notice that the number of elements is 19 not 20 because the current element is not included.

>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))

19

 

# Get the `href` attribute from all similar elements

>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]

['catalogue/a-light-in-the-attic_1000/index.html',

 'catalogue/soumission_998/index.html',

 'catalogue/sharp-objects_997/index.html',

 ...]

为了增加一点复杂性，假设我们出于某种原因想要使用该元素作为起点来获取所有书籍的数据：

>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():

        print({

            "name": product.css_first('h3 a::text'),

            "price": product.css_first('.price_color').re_first(r'[\d\.]+'),

            "stock": product.css('.availability::text')[-1].clean()

        })

{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}

{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}

{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}

...

处理结构变化

假设你正在抓取具有如下结构的页面：

<div class="container">

    <section class="products">

        <article class="product" id="p1">

            <h3>Product 1</h3>

            <p class="description">Description 1</p>

        </article>

        <article class="product" id="p2">

            <h3>Product 2</h3>

            <p class="description">Description 2</p>

        </article>

    </section>

</div>

如果你想抓取第一个产品，也就是有p1ID 的产品。你可能会写一个这样的选择器：

page.css('#p1')

当网站所有者实施结构性变化时：

<div class="new-container">

    <div class="product-wrapper">

        <section class="products">

            <article class="product new-class" data-id="p1">

                <div class="product-info">

                    <h3>Product 1</h3>

                    <p class="new-description">Description 1</p>

                </div>

            </article>

            <article class="product new-class" data-id="p2">

                <div class="product-info">

                    <h3>Product 2</h3>

                    <p class="new-description">Description 2</p>

                </div>

            </article>

        </section>

    </div>

</div>

选择器将不再起作用，您的代码需要维护。这就是 Scrapling 自动匹配功能发挥作用的地方：

from scrapling import Adaptor

# Before the change

page = Adaptor(page_source, url='example.com')

element = page.css('#p1' auto_save=True)

if not element:  # One day website changes?

    element = page.css('#p1', auto_match=True)  # Scrapling still finds it!

# the rest of the code...

通过过滤器查找元素

>> from scrapling import Fetcher

>> page = Fetcher().get('https://quotes.toscrape.com/')

# Find all elements with tag name `div`.

>> page.find_all('div')

[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,

 <data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,

...]

 

# Find all div elements with a class that equals `quote`.

>> page.find_all('div', class_='quote')

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

 <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

...]

 

# Same as above.

>> page.find_all('div', {'class': 'quote'})

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

 <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

...]

 

# Find all elements with a class that equals `quote`.

>> page.find_all({'class': 'quote'})

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

 <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

...]

 

# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.

>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]

 

# Find all elements that don't have children.

>> page.find_all(lambda element: len(element.children) > 0)

[<data='<html lang="en"><head><meta charset="UTF...'>,

 <data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,

 <data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,

...]

 

# Find all elements that contain the word 'world' in its content.

>> page.find_all(lambda element: "world" in element.text)

[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,

 <data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]

 

# Find all span elements that match the given regex

>> page.find_all('span', re.compile(r'world'))

[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]

 

# Find all div and span elements with class 'quote' (No span elements like that so only div returned)

>> page.find_all(['div', 'span'], {'class': 'quote'})

[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

 <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,

...]

 

# Mix things up

>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')

['Albert Einstein',

 'J.K. Rowling',

...]

许可证协议

本项目的开发与发布遵循BSD-3-Clause开源许可协议。

项目地址

Scrapling：【GitHub传送门】

参考资料

https://camoufox.com/python/installation/#download-the-browser
https://github.com/Vinyzu
https://github.com/daijro/browserforge
https://github.com/daijro/camoufox

# web安全 # 爬虫 # 自适应安全 # 网络爬虫 # 数据收集

免责声明

1.一般免责声明：本文所提供的技术信息仅供参考，不构成任何专业建议。读者应根据自身情况谨慎使用且应遵守《中华人民共和国网络安全法》，作者及发布平台不对因使用本文信息而导致的任何直接或间接责任或损失负责。

2. 适用性声明：文中技术内容可能不适用于所有情况或系统，在实际应用前请充分测试和评估。若因使用不当造成的任何问题，相关方不承担责任。

3. 更新声明：技术发展迅速，文章内容可能存在滞后性。读者需自行判断信息的时效性，因依据过时内容产生的后果，作者及发布平台不承担责任。

本文为 Alpha_h4ck 独立观点，未经授权禁止转载。
如需授权、对文章有疑问或需删除稿件，请联系 FreeBuf 客服小蜜蜂（微信：freebee1024）

被以下专辑收录，发现更多精彩内容

+ 收入我的专辑

+ 加入我的收藏

展开更多