SeaEagleI / legal_document_analysis Public

Notifications You must be signed in to change notification settings
Fork 5
Star 27

爬取裁判文书网上的文书并进行特征分析和罪名预测

27 stars 5 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
cache		cache
charge_prediction		charge_prediction
crawler		crawler
data		data
driver		driver
logs		logs
map		map
preprocess		preprocess
utils		utils
README.md		README.md
config.py		config.py

Repository files navigation

法律文书数据挖掘与分析

抓取裁判文书网部分文书，并进行特征分析和罪名预测

数据爬取

目前爬取了2989757条文书正文数据，json文件大小约21.5G（截至2022-06-01 00:00）
字段列表、含义及值集合：wenshulist1.js
前100个正文数据样本：sample_100.json
截至2022年6月25日所有1.3亿文书的元数据：count

数据预处理

通过编写数据库接口，已经实现了本地低内存条件下的高效数据访问
judge.db文件大小22G，位于服务器的data目录下，支持sqlite3接口访问

分析和挖掘

表层特征统计（主要使用左菜单栏爬虫）
刑事案件罪名预测（数据集的标注格式和预测模型参考刘知远组的COLING18论文）

参考

https://github.com/yeyeye777/wenshu_spider (api网址)
https://gitee.com/Lyong9102/cp_wenshu (模拟生成cookie、逆向js)
https://blog.csdn.net/feilong_86/article/details/102620316 (逆向js)
https://blog.csdn.net/weixin_47345503/article/details/118554613 (逆向js)
https://www.whcsrl.com/blog/1017807 (query分流)
https://github.com/SeaEagleI/house_price_analysis (多进程并行)
https://github.com/thunlp/attribute_charge (罪名预测)

About

爬取裁判文书网上的文书并进行特征分析和罪名预测

Report repository

Releases

No releases published

Packages

No packages published

Contributors 4

Languages