6 years ago · 6ea5d5af16
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,2 @@
 
				+# Auto detect text files and perform LF normalization
			
 
				+* text=auto
			
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,5 @@
 
				+*.exe
			
 
				+*.json
			
 
				+*.zip
			
 
				+*/
			
 
				+*.txt
			
--- a/README.md
+++ b/README.md
@@ -0,0 +1,131 @@
 
				+# 教学立方课件下载器
			
 
				+
			
 
				+在线教学平台——[教学立方](teaching.applysquare.com)的课件批量下载脚本，基于**Python** + **ChromeDriver**
			
 
				+
			
 
				+> 创建日期：2020-03-30  
			
 
				+> 更新日期：2020-04-16
			
 
				+
			
 
				+## 版本更新信息（重要！）
			
 
				+
			
 
				+1. 已于4月16日上午发布最新版本的release包，**请于4月14日前获取代码的同学重新**[**下载脚本**](https://github.com/EricZhu-42/PedagogySquare_Downloader/releases/download/v1.2_stable/PedagogySquare_Downloader_20200416.zip)。
			
 
				+2. 当前**已适配最新的教学立方4.2版本**（推出于**2020年4月10日**），且增加了对课件文件夹的支持
			
 
				+
			
 
				+## 程序特色
			
 
				+
			
 
				+1. **一键下载**所有课程的全部课件，方便快捷
			
 
				+2. **可下载未直接开放下载的课件**，视频等内容
			
 
				+3. **可深度配置**的课程筛选/文件拓展名筛选功能
			
 
				+
			
 
				+以下为运行过程中控制台截图示例：
			
 
				+
			
 
				+![](./figure/1.png)
			
 
				+
			
 
				+> 本程序旨在方便学生下载教学立方平台上的课件及相关教学资料，消除下载文件的重复劳动  
			
 
				+> 请尊重教师的知识产权与劳动成果。除非获得教师许可，请勿将下载得到的文件在互联网上进行传播  
			
 
				+> 如本程序损害了您的权益，请联系作者删除相关代码  
			
 
				+
			
 
				+## 开发环境
			
 
				+
			
 
				+开发过程中使用的环境与第三方模块版本如下：
			
 
				+
			
 
				+- **Python** = 3.7.4
			
 
				+
			
 
				+- **Requests** = 2.22.0
			
 
				+
			
 
				+- **Selenium** = 3.141.0
			
 
				+
			
 
				+使用的浏览器与WebDriver为：
			
 
				+
			
 
				+- **Chrome** = 75.0.3770.142，包含对应版本的ChromeDriver
			
 
				+
			
 
				+> **理论上**该脚本**兼容**其他版本的Python环境（新的一般没问题，旧一点的应该也行），但请在发生错误时关注兼容性问题
			
 
				+>
			
 
				+> 该脚本针对Chrome + ChromeDriver开发，如使用其他Browser + WebDriver组合需要修改脚本中的WebDriver参数  
			
 
				+（更新：**Chromium Edge** + ChromeDriver **可以**正常使用该脚本）
			
 
				+
			
 
				+## 使用方法
			
 
				+
			
 
				+### 1. 配置环境（请参考其他教程）
			
 
				+
			
 
				+1. 安装对应版本的Python
			
 
				+
			
 
				+2. 安装对应版本的Python模块：Requests, Selenium（推荐使用[Anaconda](https://www.anaconda.com/)进行管理）
			
 
				+
			
 
				+3. 安装Chrome，并下载Chrome[对应版本的WebDriver](https://chromedriver.chromium.org/downloads)
			
 
				+
			
 
				+> 提供一种安装ChromeDriver的简单方法：根据自己的Chrome版本，在[镜像站点](http://npm.taobao.org/mirrors/chromedriver/)下载对应版本的ChromeDriver，**与脚本放在同一目录**即可。
			
 
				+
			
 
				+### 2. 修改配置文件
			
 
				+
			
 
				+修改文件 `config.json` ，填入用户名、密码等信息
			
 
				+
			
 
				+> 关于如何修改配置文件，请参考”最简配置方案“一章  
			
 
				+> 关于配置文件内各项参数的说明，请参考”配置文件说明“一章  
			
 
				+
			
 
				+### 3. 运行脚本
			
 
				+
			
 
				+运行 `download.py`  
			
 
				+
			
 
				+> 注：若运行过程中出现下载速度过慢等现象，可能是由于与教学立方网站连接不稳定，请尝试重新运行脚本。
			
 
				+
			
 
				+## 项目结构介绍
			
 
				+
			
 
				+| 文件名              | 功能                     |
			
 
				+| ------------------- | ------------------------ |
			
 
				+| figure/             | 脚本说明中用到的图片文件 |
			
 
				+| download.py         | 脚本运行入口             |
			
 
				+| config.json         | 执行参数的配置文件       |
			
 
				+| config_example.json | 供参考的样例配置文件     |
			
 
				+
			
 
				+## 最简配置方案
			
 
				+
			
 
				+```json
			
 
				+{
			
 
				+	"username": "your_username",
			
 
				+	"password": "your_password",
			
 
				+	"headless_mode": true,
			
 
				+	"download_all_ext": true,
			
 
				+	"download_all_courses": true,
			
 
				+	"ext_list": [],
			
 
				+	"ext_expel_list": [],
			
 
				+	"cid_list": []
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+将 `your_username` 与 `your_password` （**保留双引号**）替换成你的**手机号**和**教学立方登录密码**即可，其他参数无需修改。
			
 
				+
			
 
				+> 请确保json文件格式正确，可参考提供的 `config_example.json` 进行配置。
			
 
				+
			
 
				+## 配置文件说明
			
 
				+
			
 
				+以下对 `config.json` 内各项参数进行简要说明：
			
 
				+
			
 
				+| 参数名               | 类型 | 含义                                                |
			
 
				+| -------------------- | ---- | --------------------------------------------------- |
			
 
				+| username             | str  | 教学立方登录用户名（一般为手机号）                  |
			
 
				+| password             | str  | 教学立方登录密码                                    |
			
 
				+| headless_mode        | bool | 是否启用WebDriver的headless模式（运行时不显示界面） |
			
 
				+| download_all_ext     | bool | 是否下载所有类型的文件                              |
			
 
				+| download_all_courses | bool | 是否下载所有课程的课件                              |
			
 
				+| ext_list             | list | 下载文件的类型（如：pdf，docx，zip）                |
			
 
				+| ext_expel_list       | list | 排除文件的类型                                      |
			
 
				+| cid_list             | list | 需要下载的课程ID                                    |
			
 
				+
			
 
				+#### 注意：
			
 
				+
			
 
				+1. 文件类型参数优先级为：`ext_expel_list` > `download_all_ext` > `ext_list`  
			
 
				+   如：若希望下载“除了zip格式文件外的所有类型文件“，应设置参数为
			
 
				+
			
 
				+   - `download_all_ext` = `true`
			
 
				+   - `ext_list` = `["zip"]`
			
 
				+
			
 
				+2. 课程ID在课程主页地址中查看，例如：  
			
 
				+   ![](./figure/0.png)  
			
 
				+   图中对应课程的ID为**8261**  
			
 
				+
			
 
				+
			
 
				+## 版权信息
			
 
				+
			
 
				+联系邮箱：zhuxinhao00@gmail.com
			
 
				+
			
 
				+本项目基于MIT协议开源
			
--- a/download.py
+++ b/download.py
@@ -0,0 +1,209 @@
 
				+#!/usr/bin/env python
			
 
				+# -*- encoding: utf-8 -*-
			
 
				+'''
			
 
				+@Author  :   liuyuqi
			
 
				+@Contact :   liuyuqi.gov@msn.cn
			
 
				+@Time    :   2020/05/03 19:10:15
			
 
				+@Version :   1.0
			
 
				+@License :   Copyright ? 2017-2020 liuyuqi. All Rights Reserved.
			
 
				+@Desc    :   teaching.applysquare.com
			
 
				+'''
			
 
				+
			
 
				+import json
			
 
				+import logging
			
 
				+import os
			
 
				+import re
			
 
				+import time
			
 
				+from contextlib import closing
			
 
				+
			
 
				+import requests
			
 
				+from selenium import webdriver
			
 
				+from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
			
 
				+
			
 
				+# Function dealing with illegal characters of windows filename
			
 
				+def filename_filter(name:str):
			
 
				+    illegal_list = list('/\:*?”"<>|')
			
 
				+    for char in illegal_list:
			
 
				+        name = name.replace(char, ' ')
			
 
				+    return name
			
 
				+
			
 
				+def construct_attchment_list(driver, token, pid, uid, cid):
			
 
				+    attachment_list = list()
			
 
				+    attachment_info_url = attachment_url_fmt.format(token, pid, 1, uid, cid)
			
 
				+    driver.get(attachment_info_url)
			
 
				+    raw_info = re.search(r'\{.*\}', driver.page_source).group(0)
			
 
				+    info = json.loads(raw_info).get('message')
			
 
				+    file_num = info.get('count')
			
 
				+
			
 
				+    current_page = 1
			
 
				+    # Add attachment path to attachment_list
			
 
				+    while len(attachment_list) < file_num:
			
 
				+        current_url = attachment_url_fmt.format(token, pid, current_page, uid, cid)
			
 
				+        driver.get(current_url)
			
 
				+        raw_info  = re.search(r'\{.*\}', driver.page_source).group(0)
			
 
				+        info = json.loads(raw_info).get('message')
			
 
				+        attachment_list.extend(info.get('list'))
			
 
				+        current_page += 1
			
 
				+    return attachment_list
			
 
				+
			
 
				+# Load config from config.json
			
 
				+with open('config.json', 'r') as f:
			
 
				+    config = json.loads(f.read())
			
 
				+
			
 
				+user_name = config.get('username')
			
 
				+user_passwd = config.get('password')
			
 
				+headless_mode = config.get('headless_mode')
			
 
				+download_all_ext = config.get('download_all_ext')
			
 
				+download_all_courses = config.get('download_all_courses')
			
 
				+ext_list = config.get('ext_list')
			
 
				+ext_expel_list = config.get('ext_expel_list')
			
 
				+cid_list = config.get('cid_list')
			
 
				+
			
 
				+# auto_restart = True
			
 
				+# speed_threshold = 50 * 1024
			
 
				+
			
 
				+# Some metadata
			
 
				+login_url = r"https://teaching.applysquare.com/Home/User/login"
			
 
				+attachment_url_fmt = r'https://teaching.applysquare.com/Api/CourseAttachment/getList/token/{}?parent_id={}&page={}&plan_id=-1&uid={}&cid={}'
			
 
				+course_info_url_fmt = r'https://teaching.applysquare.com/Api/Public/getIndexCourseList/token/{}?type=1&usertype=1&uid={}'
			
 
				+token_pattern = r'(https://teaching\.applysquare\.com/Api/Public/getIndexCourseList/token/.*?)"'
			
 
				+
			
 
				+# Start the webdriver
			
 
				+caps = DesiredCapabilities.CHROME
			
 
				+caps['loggingPrefs'] = {'performance': 'ALL'}
			
 
				+opt = webdriver.ChromeOptions()
			
 
				+opt.add_experimental_option('w3c', False)
			
 
				+opt.add_argument('log-level=3')
			
 
				+if headless_mode:
			
 
				+    opt.add_argument("--headless")
			
 
				+driver = webdriver.Chrome(options=opt, desired_capabilities=caps)
			
 
				+
			
 
				+# Login to Pedagogy Square
			
 
				+driver.get(login_url)
			
 
				+time.sleep(1)
			
 
				+
			
 
				+driver.find_element_by_xpath(r"/html/body/div[2]/div/div[2]/div/div/div/div/div[2]/div/div/div[1]/input").send_keys(user_name) # Send username
			
 
				+driver.find_element_by_xpath(r'//*[@id="id_login_password"]').send_keys(user_passwd) # Send password
			
 
				+driver.find_element_by_xpath(r'//*[@id="id_login_button"]').click() # Submit
			
 
				+time.sleep(0.5)
			
 
				+
			
 
				+# Dealing with student-teacher selection
			
 
				+try:
			
 
				+    driver.find_element_by_xpath(r'/html/body/div[2]/div/div[2]/div/div/div[1]/div[2]/div[2]/div[1]/i').click() # Choose student
			
 
				+    driver.find_element_by_xpath(r'/html/body/div[2]/div/div[2]/div/div/div[1]/div[4]/a').click() # Submit
			
 
				+except Exception:
			
 
				+    pass
			
 
				+
			
 
				+time.sleep(0.5)
			
 
				+if (driver.current_url == r'https://teaching.applysquare.com/S/Index/index'):
			
 
				+    print("Login Successfully!")
			
 
				+else:
			
 
				+    print("Login Error --- Please check your username & password")
			
 
				+    print("Disable headless mode for detailed information")
			
 
				+
			
 
				+# Get token for authorization
			
 
				+token = None
			
 
				+while not token:
			
 
				+    for entry in driver.get_log('performance'):
			
 
				+        match_obj = re.search(token_pattern, entry.get('message'))
			
 
				+        if match_obj:
			
 
				+            temp_url = match_obj.group(1)
			
 
				+            token = re.search(r'token/(.*?)\?', temp_url).group(1)
			
 
				+            uid = re.search(r'uid=(.*?)', temp_url).group(1)
			
 
				+            break
			
 
				+
			
 
				+cid2name_dict = dict()
			
 
				+course_info_url = course_info_url_fmt.format(token, uid)
			
 
				+driver.get(course_info_url)
			
 
				+raw_info = re.search(r'\{.*\}', driver.page_source).group(0)
			
 
				+info = json.loads(raw_info).get('message')
			
 
				+for entry in info:
			
 
				+    cid2name_dict[entry.get('cid')] = entry.get('name')
			
 
				+
			
 
				+if download_all_courses:
			
 
				+    cid_list = cid2name_dict.keys()
			
 
				+
			
 
				+for cid in cid_list:
			
 
				+    cid = str(cid) # Prevent bug caused by wrong type of cid
			
 
				+    course_name = filename_filter(cid2name_dict[cid])
			
 
				+    print("\nDownloading files of course {}".format(course_name))
			
 
				+
			
 
				+    # Create dir for this course
			
 
				+    try:
			
 
				+        os.chdir("./{}".format(course_name))
			
 
				+    except FileNotFoundError:
			
 
				+        os.mkdir("{}".format(course_name))
			
 
				+        os.chdir("./{}".format(course_name))
			
 
				+
			
 
				+    # Construct attachment list, with some dirs in it
			
 
				+    course_attachment_list = construct_attchment_list(driver=driver, token=token, pid=0, uid=uid, cid=cid)
			
 
				+
			
 
				+    # Iteratively add files in dirs to global attachment list
			
 
				+    dir_counter = 0
			
 
				+    for entry in course_attachment_list:
			
 
				+        if (entry.get('ext') == 'dir'):
			
 
				+            dir_counter += 1
			
 
				+            # Add dir content to attachment list
			
 
				+            dir_id = entry.get('id')
			
 
				+            course_attachment_list.extend(construct_attchment_list(driver=driver, token=token, pid=dir_id, uid=uid, cid=cid))
			
 
				+
			
 
				+    print("Get {:d} files, with {:d} dirs".format(len(course_attachment_list)-dir_counter, dir_counter))
			
 
				+
			
 
				+    # Download attachments
			
 
				+    for entry in course_attachment_list:
			
 
				+        ext = entry.get('ext')
			
 
				+        if (ext == 'dir') or (ext in ext_expel_list) or (not download_all_ext and ext not in ext_list):
			
 
				+            continue
			
 
				+
			
 
				+        if (ext in entry.get('title')):
			
 
				+            filename = filename_filter(entry.get('title'))
			
 
				+        else:
			
 
				+            filename = filename_filter("{}.{}".format(entry.get('title'), ext))
			
 
				+
			
 
				+        filesize = entry.get('size')
			
 
				+
			
 
				+        with closing(requests.get(entry.get('path').replace('amp;', ''), stream=True)) as res:
			
 
				+            content_size = eval(res.headers['content-length'])
			
 
				+
			
 
				+            if filename in os.listdir():
			
 
				+                # If file is up-to date, continue; else, delete and re-download
			
 
				+                if os.path.getsize(filename) == content_size:
			
 
				+                    print("File \"{}\" is up-to-date".format(filename))
			
 
				+                    continue
			
 
				+                else:
			
 
				+                    print("Updating File {}".format(filename))
			
 
				+                    os.remove(filename)
			
 
				+
			
 
				+            print("Downloading {}, filesize = {}".format(filename, filesize))
			
 
				+            chunk_size = min(content_size, 10240)
			
 
				+            with open(filename, "wb") as f:
			
 
				+                chunk_count = 0
			
 
				+                start_time = time.time()
			
 
				+                # previous_time = time.time()
			
 
				+                # lag_counter = 0
			
 
				+                total = content_size / 1024 / 1024
			
 
				+                for data in res.iter_content(chunk_size=chunk_size):
			
 
				+                    chunk_count += 1
			
 
				+                    processed = len(data) * chunk_count / 1024 / 1024
			
 
				+                    current_time = time.time()
			
 
				+                    if chunk_count < 5:
			
 
				+                        print(r"    Total: {:.2f} MB  Processed: {:.2f} MB ({:.2f}%)".format(total, processed, processed/total*100), end = '\r')
			
 
				+                    else:
			
 
				+                        remaining = (current_time-start_time)/processed*(total-processed)
			
 
				+                        print(r"    Total: {:.2f} MB  Processed: {:.2f} MB ({:.2f}%), ETA {:.2f}s".format(total, processed, processed/total*100, remaining), end = '\r')
			
 
				+                    f.write(data)
			
 
				+
			
 
				+                    # speed = chunk_size / 1.0 * (current_time - previous_time)
			
 
				+                    # if speed < speed_threshold:
			
 
				+                    #     lag_counter += 1
			
 
				+                    # else:
			
 
				+                    #     lag_counter = 0
			
 
				+
			
 
				+                    # if lag_counter > 10:
			
 
				+                    #     print("Restart downloading of file {}".format(filename))
			
 
				+                    #     attachment_list.append(entry)
			
 
				+                    #     continue
			
 
				+
			
 
				+    os.chdir(r'../') # Switch directory
			
 
				+
			
 
				+print("Done!")