liuyuqi-dellpc 24df2046f9 Merge branch 'master' of https://git.yoqi.me/lyq/crawl_mrdx | 1 year ago | |
---|---|---|
.vscode | 1 year ago | |
bin | 2 years ago | |
conf | 1 year ago | |
crawl_mrdx | 1 year ago | |
data | 5 years ago | |
screenshot | 2 years ago | |
shell | 1 year ago | |
test | 5 years ago | |
utils | 5 years ago | |
.dockerignore | 1 year ago | |
.gitignore | 1 year ago | |
506.png | 1 year ago | |
Dockerfile | 1 year ago | |
LICENSE | 2 years ago | |
README.md | 1 year ago | |
docker-compose.debug.yml | 1 year ago | |
docker-compose.yml | 1 year ago | |
main.py | 1 year ago | |
main.spec | 1 year ago | |
main.ui | 1 year ago | |
requirements.txt | 1 year ago | |
resources.qrc | 1 year ago |
初步完成,单线程,文明爬虫(每次爬虫1-3s休息)。
cd my_project_dir
virtualenv -p /opt/python/3.8.5/bin/python3 .venv
source .venv/bin/activate
pip install -r requirements.txt
# method 1
python main.py --start 20230822 --end 20230823
# method 2,先配置 conf/config.json
python main.py
Ubuntu 打包:
pip install pyinstaller
pyinstaller -F -c main.py
docker 打包:
docker build -t jianboy/crawl_mrdx:v1.0.5 .
docker run -it --rm -v /data/crawl-mrdx:/app jianboy/crawl_mrdx:v1.0.5
目前下载到 ./data/20130822/07.pdf ,2275天的资讯日报,总共16G。
2015年开始
python main.py --start 20210822 --end 20220822
python main.py --start 20220822 --end 20230418
之后不生成pdf版本了
Licensed under the Apache 2.0 © liuyuqi.gov@msn.cn