freeBuf
主站

分类

漏洞 工具 极客 Web安全 系统安全 网络安全 无线安全 设备/客户端安全 数据安全 安全管理 企业安全 工控安全

特色

头条 人物志 活动 视频 观点 招聘 报告 资讯 区块链安全 标准与合规 容器安全 公开课

官方公众号企业安全新浪微博

FreeBuf.COM网络安全行业门户,每日发布专业的安全资讯、技术剖析。

FreeBuf+小程序

FreeBuf+小程序

使用python结合云码平台和ddddocr解析验证码实现古诗词网登录
2024-10-31 17:08:18
所属地 四川省

一、验证码识别

1、第三方云码平台的使用

云码平台:https://www.jfbym.com/

先注册一个账号,在个人中心有剩余积分和token(关注微信公众号可免费获取积分)

1730361792_672339c000ae119822944.png!small?1730361791618

在开发文档里选择语言python

1730361839_672339ef2988f08ac209f.png!small?1730361838799

我们只需修改token、type、image参数

1730362042_67233aba99afefbee6f7c.png!small?1730362042262

type是要解析的类型;

1730362049_67233ac12b19003192132.png!small?1730362048960

token在个人中心处;image是要解析的图片

import base64
import requests
from lxml import etree

def verify(encoded_image):
url = "http://api.jfbym.com/api/YmServer/customApi"
data = {
## 关于参数,一般来说有3个;不同类型id可能有不同的参数个数和参数名,找客服获取
"token": "Your Token",
"type": "10110",
"image": encoded_image,
}
_headers = {
"Content-Type": "application/json"
}
response = requests.request("POST", url, headers=_headers, json=data).json()
return(response['data']['data'])

url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
# 解析验证码图片的url
img_url = 'https://www.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
# 解析出验证码图片的数据
img_data = requests.get(url=img_url,headers=headers).content

encoded_image = base64.b64encode(img_data).decode()
print(verify(encoded_image))

2、ddddocr 库的使用

从pypi安装ddddocr库,使用国内源加快下载速度(python版本问题可能会报错)

pip install ddddocr -i https://pypi.douban.com/simple

git安装ddddocr库

git clone https://github.com/sml2h3/ddddocr.git
cd ddddocr
python setup.py

安装完成测试一下

import ddddocr                       # 导入 ddddocr
ocr = ddddocr.DdddOcr()              # 实例化
with open('20241031_113342_code.jpg', 'rb') as f:     # 打开图片
    img_bytes = f.read()             # 读取图片
res = ocr.classification(img_bytes)  # 识别
print(res)

1730362334_67233bdec7514ab1e7aad.png!small?1730362334367


import ddddocr
import base64
import requests
from lxml import etree

url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
# 解析验证码图片的url
img_url = 'https://www.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
# 解析出验证码图片的数据
img_data = requests.get(url=img_url,headers=headers).content
# 使用ddddocr进行OCR识别
ocr = ddddocr.DdddOcr()
res = ocr.classification(img_data)
print(res)

1730362453_67233c55adebcb56d938c.png!small?1730362453250

能识别到 qsbu,但是会出现"欢迎使用ddddocr,本项目专注带动行业内卷***"提示语, 可以加一个参数show_ad=False

ocr = ddddocr.DdddOcr(show_ad=False)

1730362514_67233c92a7d3d863de892.png!small?1730362514351

二、模拟登录

登陆抓包

1730363833_672341b9d32776d306ecf.png!small?1730363834179

重新登录发现__VIEWSTATE和__VIEWSTATEGENERATOR是动态变化的,在页面元素中可以解析

1730363210_67233f4a0cf2d49824b3f.png!small?1730363209364

import requests
import base64
from lxml import etree

def verify(encoded_image):
url = "http://api.jfbym.com/api/YmServer/customApi"
data = {
## 关于参数,一般来说有3个;不同类型id可能有不同的参数个数和参数名,找客服获取
"token": "Your Token",
"type": "10110",
"image": encoded_image,
}
_headers = {
"Content-Type": "application/json"
}
response = requests.request("POST", url, headers=_headers, json=data).json()
return(response['data']['data'])

url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
# 解析验证码图片的url
img_url = 'https://www.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
# 解析出验证码图片的数据
img_data = requests.get(url=img_url,headers=headers).content

encoded_image = base64.b64encode(img_data).decode()
print(verify(encoded_image))

login_url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
data = {
'__VIEWSTATE':tree.xpath('//*[@id="__VIEWSTATE"]/@value'),
'__VIEWSTATEGENERATOR':tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value'),
'from':'http://www.gushiwen.cn/user/collect.aspx',
'email':'123456789@qq.com',
'pwd':'12345678',
'code':verify(encoded_image),
'denglu ':'登录'
}
admin_url = 'https://www.gushiwen.cn/user/collect.aspx'
login_response = requests.post(url=login_url,data=data,headers=headers)
if login_response.status_code == 200:
print('success')
admin_response = requests.get(url=admin_url,headers=headers)
with open('./gushiwenwang.html','w',encoding='utf-8') as fp:
fp.write(admin_response.text)
else:
print('登陆失败,状态码为',login_response.status_code)

1730426387_67243613ea3004aae7019.png!small?1730426388812

爬取后台页面需要携带cookie发起请求:

手工处理cookie:把cookie添加到headers中发起请求,成功爬取到登陆后的页面数据

headers_ = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
'cookie': 'Hm_lvt_9007fab6814e892d3020a64454da5a55=1730338566; HMACCOUNT=FD4F3638578FCE59; login=flase; ASP.NET_SessionId=5szzvii3koeubyqxwdp2s1ad; wsEmail=123456789%40qq.com; ticketStr=207674433%7cgQF98DwAAAAAAAAAAS5odHRwOi8vd2VpeGluLnFxLmNvbS9xLzAyQ1BwSlI0bGVkN2kxb2NyYTFEMTAAAgQMDiNnAwQAjScA; gsw2017user=6605987%7c11ADD12B4E53BFAD07AC176F7D79097B%7c2000%2f1%2f1%7c2000%2f1%2f1; wxopenid=defoaltid; gswZhanghao=123456789%40qq.com; gswEmail=123456789%40qq.com; Hm_lpvt_9007fab6814e892d3020a64454da5a55=1730350618; codeyz=5a5e47bd79cbe52e'
}

1730426549_672436b5c1c3956db5c09.png!small?1730426550122

但是cookie会变

1730365044_67234674732893073b1cd.png!small?1730365043769

使用session会话对象自动处理

import requests
import base64
from lxml import etree


def verify(encoded_image):


    url = "http://api.jfbym.com/api/YmServer/customApi"
    data = {
    ## 关于参数,一般来说有3个;不同类型id可能有不同的参数个数和参数名,找客服获取
    "token": "YourToken",
    "type": "10110",
    "image": encoded_image,
    }
    _headers = {
    "Content-Type": "application/json"
    }
    response = requests.request("POST", url, headers=_headers, json=data).json()
    return (response['data']['data'])

url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
# 解析验证码图片的url
img_url = 'https://www.gushiwen.cn' + tree.xpath('//*[@id="imgCode"]/@src')[0]
# 解析出验证码图片的数据
img_data = requests.get(url=img_url, headers=headers).content

encoded_image = base64.b64encode(img_data).decode()
print(verify(encoded_image))

login_url = 'https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx'
data = {
    '__VIEWSTATE': tree.xpath('//*[@id="__VIEWSTATE"]/@value'),
    '__VIEWSTATEGENERATOR': tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value'),
    'from': 'http://www.gushiwen.cn/user/collect.aspx',
    'email': '1586133693@qq.com',
    'pwd': 'cjt030930',
    'code': verify(encoded_image),
    'denglu ': '登录'
}
# 发送登录请求
admin_url = 'https://www.gushiwen.cn/user/collect.aspx'
session = requests.Session()
login_response = session.post(url=login_url, json=data, headers=headers)
print(login_response.cookies)
admin_response = session.get(url=admin_url, headers=headers)
if login_response.status_code == 200:
    print('success')
    with open('./admin.html', 'w', encoding='utf-8') as fp:
         fp.write(admin_response.text)
else:
    print('登陆失败,状态码为', login_response.status_code)
session.close()

输出了我们登陆的cookie,携带该cookie对后台页面发起请求,成功爬取后台页面

1730426960_6724385050c40033e51c0.png!small?1730426960608

# xpath hack # 爬虫 # Python爬虫 # 《网络安全法》
本文为 独立观点,未经允许不得转载,授权请联系FreeBuf客服小蜜蜂,微信:freebee2022
被以下专辑收录,发现更多精彩内容
+ 收入我的专辑
+ 加入我的收藏
相关推荐
  • 0 文章数
  • 0 关注者
文章目录