新闻电讯爬虫。 http://mrdx.cn/

liuyuqi-dellpc 24df2046f9 Merge branch 'master' of https://git.yoqi.me/lyq/crawl_mrdx 7 months ago
.vscode 61a7c9c82b add dockerfile 8 months ago
bin fc64b0de08 优化项目结构 1 year ago
conf 6119ef84b5 0 8 months ago
crawl_mrdx a6021743b0 add gui 7 months ago
data efdb552c5a init 4 years ago
screenshot fc64b0de08 优化项目结构 1 year ago
shell 17b0efbadb fix error,最后少于size压缩 8 months ago
test 3d95ef6752 更改起步日期 4 years ago
utils 7d32cc1fea 修改 README.md 4 years ago
.dockerignore 61a7c9c82b add dockerfile 8 months ago
.gitignore b51d93929f add build 8 months ago
506.png a6021743b0 add gui 7 months ago
Dockerfile ef0b10842d 0 8 months ago
LICENSE 9bdfd1ceee add license 1 year ago
README.md 2c1c78f05d modify readme 7 months ago
docker-compose.debug.yml 61a7c9c82b add dockerfile 8 months ago
docker-compose.yml 61a7c9c82b add dockerfile 8 months ago
main.py a6021743b0 add gui 7 months ago
main.spec b51d93929f add build 8 months ago
main.ui a6021743b0 add gui 7 months ago
requirements.txt b51d93929f add build 8 months ago
resources.qrc a6021743b0 add gui 7 months ago

README.md

新闻电讯爬虫

Version .Python

初步完成,单线程,文明爬虫(每次爬虫1-3s休息)。

cd my_project_dir
virtualenv -p /opt/python/3.8.5/bin/python3 .venv
source .venv/bin/activate
pip install -r requirements.txt

# method 1
python main.py --start 20230822 --end 20230823

# method 2,先配置 conf/config.json
python main.py

Ubuntu 打包:

pip install pyinstaller
pyinstaller -F -c main.py

docker 打包:

docker build -t jianboy/crawl_mrdx:v1.0.5 .
docker run -it --rm -v /data/crawl-mrdx:/app jianboy/crawl_mrdx:v1.0.5

截图

目前下载到 ./data/20130822/07.pdf ,2275天的资讯日报,总共16G。

History

2015年开始

python main.py --start 20210822 --end 20220822

python main.py --start 20220822 --end 20230418

之后不生成pdf版本了

License

Licensed under the Apache 2.0 © liuyuqi.gov@msn.cn