liuyuqi-dellpc 3 years ago
commit
6ea5d5af16
4 changed files with 347 additions and 0 deletions
  1. 2 0
      .gitattributes
  2. 5 0
      .gitignore
  3. 131 0
      README.md
  4. 209 0
      download.py

+ 2 - 0
.gitattributes

@@ -0,0 +1,2 @@
+# Auto detect text files and perform LF normalization
+* text=auto

+ 5 - 0
.gitignore

@@ -0,0 +1,5 @@
+*.exe
+*.json
+*.zip
+*/
+*.txt

+ 131 - 0
README.md

@@ -0,0 +1,131 @@
+# 教学立方课件下载器
+
+在线教学平台——[教学立方](teaching.applysquare.com)的课件批量下载脚本,基于**Python** + **ChromeDriver**
+
+> 创建日期:2020-03-30  
+> 更新日期:2020-04-16
+
+## 版本更新信息(重要!)
+
+1. 已于4月16日上午发布最新版本的release包,**请于4月14日前获取代码的同学重新**[**下载脚本**](https://github.com/EricZhu-42/PedagogySquare_Downloader/releases/download/v1.2_stable/PedagogySquare_Downloader_20200416.zip)。
+2. 当前**已适配最新的教学立方4.2版本**(推出于**2020年4月10日**),且增加了对课件文件夹的支持
+
+## 程序特色
+
+1. **一键下载**所有课程的全部课件,方便快捷
+2. **可下载未直接开放下载的课件**,视频等内容
+3. **可深度配置**的课程筛选/文件拓展名筛选功能
+
+以下为运行过程中控制台截图示例:
+
+![](./figure/1.png)
+
+> 本程序旨在方便学生下载教学立方平台上的课件及相关教学资料,消除下载文件的重复劳动  
+> 请尊重教师的知识产权与劳动成果。除非获得教师许可,请勿将下载得到的文件在互联网上进行传播  
+> 如本程序损害了您的权益,请联系作者删除相关代码  
+
+## 开发环境
+
+开发过程中使用的环境与第三方模块版本如下:
+
+- **Python** = 3.7.4
+
+- **Requests** = 2.22.0
+
+- **Selenium** = 3.141.0
+
+使用的浏览器与WebDriver为:
+
+- **Chrome** = 75.0.3770.142,包含对应版本的ChromeDriver
+
+> **理论上**该脚本**兼容**其他版本的Python环境(新的一般没问题,旧一点的应该也行),但请在发生错误时关注兼容性问题
+>
+> 该脚本针对Chrome + ChromeDriver开发,如使用其他Browser + WebDriver组合需要修改脚本中的WebDriver参数  
+(更新:**Chromium Edge** + ChromeDriver **可以**正常使用该脚本)
+
+## 使用方法
+
+### 1. 配置环境(请参考其他教程)
+
+1. 安装对应版本的Python
+
+2. 安装对应版本的Python模块:Requests, Selenium(推荐使用[Anaconda](https://www.anaconda.com/)进行管理)
+
+3. 安装Chrome,并下载Chrome[对应版本的WebDriver](https://chromedriver.chromium.org/downloads)
+
+> 提供一种安装ChromeDriver的简单方法:根据自己的Chrome版本,在[镜像站点](http://npm.taobao.org/mirrors/chromedriver/)下载对应版本的ChromeDriver,**与脚本放在同一目录**即可。
+
+### 2. 修改配置文件
+
+修改文件 `config.json` ,填入用户名、密码等信息
+
+> 关于如何修改配置文件,请参考”最简配置方案“一章  
+> 关于配置文件内各项参数的说明,请参考”配置文件说明“一章  
+
+### 3. 运行脚本
+
+运行 `download.py`  
+
+> 注:若运行过程中出现下载速度过慢等现象,可能是由于与教学立方网站连接不稳定,请尝试重新运行脚本。
+
+## 项目结构介绍
+
+| 文件名              | 功能                     |
+| ------------------- | ------------------------ |
+| figure/             | 脚本说明中用到的图片文件 |
+| download.py         | 脚本运行入口             |
+| config.json         | 执行参数的配置文件       |
+| config_example.json | 供参考的样例配置文件     |
+
+## 最简配置方案
+
+```json
+{
+	"username": "your_username",
+	"password": "your_password",
+	"headless_mode": true,
+	"download_all_ext": true,
+	"download_all_courses": true,
+	"ext_list": [],
+	"ext_expel_list": [],
+	"cid_list": []
+}
+```
+
+将 `your_username` 与 `your_password` (**保留双引号**)替换成你的**手机号**和**教学立方登录密码**即可,其他参数无需修改。
+
+> 请确保json文件格式正确,可参考提供的 `config_example.json` 进行配置。
+
+## 配置文件说明
+
+以下对 `config.json` 内各项参数进行简要说明:
+
+| 参数名               | 类型 | 含义                                                |
+| -------------------- | ---- | --------------------------------------------------- |
+| username             | str  | 教学立方登录用户名(一般为手机号)                  |
+| password             | str  | 教学立方登录密码                                    |
+| headless_mode        | bool | 是否启用WebDriver的headless模式(运行时不显示界面) |
+| download_all_ext     | bool | 是否下载所有类型的文件                              |
+| download_all_courses | bool | 是否下载所有课程的课件                              |
+| ext_list             | list | 下载文件的类型(如:pdf,docx,zip)                |
+| ext_expel_list       | list | 排除文件的类型                                      |
+| cid_list             | list | 需要下载的课程ID                                    |
+
+#### 注意:
+
+1. 文件类型参数优先级为:`ext_expel_list` > `download_all_ext` > `ext_list`  
+   如:若希望下载“除了zip格式文件外的所有类型文件“,应设置参数为
+
+   - `download_all_ext` = `true`
+   - `ext_list` = `["zip"]`
+
+2. 课程ID在课程主页地址中查看,例如:  
+   ![](./figure/0.png)  
+   图中对应课程的ID为**8261**  
+
+
+## 版权信息
+
+联系邮箱:zhuxinhao00@gmail.com
+
+本项目基于MIT协议开源

+ 209 - 0
download.py

@@ -0,0 +1,209 @@
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+'''
+@Author  :   liuyuqi
+@Contact :   liuyuqi.gov@msn.cn
+@Time    :   2020/05/03 19:10:15
+@Version :   1.0
+@License :   Copyright ? 2017-2020 liuyuqi. All Rights Reserved.
+@Desc    :   teaching.applysquare.com
+'''
+
+import json
+import logging
+import os
+import re
+import time
+from contextlib import closing
+
+import requests
+from selenium import webdriver
+from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
+
+# Function dealing with illegal characters of windows filename
+def filename_filter(name:str):
+    illegal_list = list('/\:*?”"<>|')
+    for char in illegal_list:
+        name = name.replace(char, ' ')
+    return name
+
+def construct_attchment_list(driver, token, pid, uid, cid):
+    attachment_list = list()
+    attachment_info_url = attachment_url_fmt.format(token, pid, 1, uid, cid)
+    driver.get(attachment_info_url)
+    raw_info = re.search(r'\{.*\}', driver.page_source).group(0)
+    info = json.loads(raw_info).get('message')
+    file_num = info.get('count')
+
+    current_page = 1
+    # Add attachment path to attachment_list
+    while len(attachment_list) < file_num:
+        current_url = attachment_url_fmt.format(token, pid, current_page, uid, cid)
+        driver.get(current_url)
+        raw_info  = re.search(r'\{.*\}', driver.page_source).group(0)
+        info = json.loads(raw_info).get('message')
+        attachment_list.extend(info.get('list'))
+        current_page += 1
+    return attachment_list
+
+# Load config from config.json
+with open('config.json', 'r') as f:
+    config = json.loads(f.read())
+
+user_name = config.get('username')
+user_passwd = config.get('password')
+headless_mode = config.get('headless_mode')
+download_all_ext = config.get('download_all_ext')
+download_all_courses = config.get('download_all_courses')
+ext_list = config.get('ext_list')
+ext_expel_list = config.get('ext_expel_list')
+cid_list = config.get('cid_list')
+
+# auto_restart = True
+# speed_threshold = 50 * 1024
+
+# Some metadata
+login_url = r"https://teaching.applysquare.com/Home/User/login"
+attachment_url_fmt = r'https://teaching.applysquare.com/Api/CourseAttachment/getList/token/{}?parent_id={}&page={}&plan_id=-1&uid={}&cid={}'
+course_info_url_fmt = r'https://teaching.applysquare.com/Api/Public/getIndexCourseList/token/{}?type=1&usertype=1&uid={}'
+token_pattern = r'(https://teaching\.applysquare\.com/Api/Public/getIndexCourseList/token/.*?)"'
+
+# Start the webdriver
+caps = DesiredCapabilities.CHROME
+caps['loggingPrefs'] = {'performance': 'ALL'}
+opt = webdriver.ChromeOptions()
+opt.add_experimental_option('w3c', False)
+opt.add_argument('log-level=3')
+if headless_mode:
+    opt.add_argument("--headless")
+driver = webdriver.Chrome(options=opt, desired_capabilities=caps)
+
+# Login to Pedagogy Square
+driver.get(login_url)
+time.sleep(1)
+
+driver.find_element_by_xpath(r"/html/body/div[2]/div/div[2]/div/div/div/div/div[2]/div/div/div[1]/input").send_keys(user_name) # Send username
+driver.find_element_by_xpath(r'//*[@id="id_login_password"]').send_keys(user_passwd) # Send password
+driver.find_element_by_xpath(r'//*[@id="id_login_button"]').click() # Submit
+time.sleep(0.5)
+
+# Dealing with student-teacher selection
+try:
+    driver.find_element_by_xpath(r'/html/body/div[2]/div/div[2]/div/div/div[1]/div[2]/div[2]/div[1]/i').click() # Choose student
+    driver.find_element_by_xpath(r'/html/body/div[2]/div/div[2]/div/div/div[1]/div[4]/a').click() # Submit
+except Exception:
+    pass
+
+time.sleep(0.5)
+if (driver.current_url == r'https://teaching.applysquare.com/S/Index/index'):
+    print("Login Successfully!")
+else:
+    print("Login Error --- Please check your username & password")
+    print("Disable headless mode for detailed information")
+
+# Get token for authorization
+token = None
+while not token:
+    for entry in driver.get_log('performance'):
+        match_obj = re.search(token_pattern, entry.get('message'))
+        if match_obj:
+            temp_url = match_obj.group(1)
+            token = re.search(r'token/(.*?)\?', temp_url).group(1)
+            uid = re.search(r'uid=(.*?)', temp_url).group(1)
+            break
+
+cid2name_dict = dict()
+course_info_url = course_info_url_fmt.format(token, uid)
+driver.get(course_info_url)
+raw_info = re.search(r'\{.*\}', driver.page_source).group(0)
+info = json.loads(raw_info).get('message')
+for entry in info:
+    cid2name_dict[entry.get('cid')] = entry.get('name')
+
+if download_all_courses:
+    cid_list = cid2name_dict.keys()
+
+for cid in cid_list:
+    cid = str(cid) # Prevent bug caused by wrong type of cid
+    course_name = filename_filter(cid2name_dict[cid])
+    print("\nDownloading files of course {}".format(course_name))
+
+    # Create dir for this course
+    try:
+        os.chdir("./{}".format(course_name))
+    except FileNotFoundError:
+        os.mkdir("{}".format(course_name))
+        os.chdir("./{}".format(course_name))
+
+    # Construct attachment list, with some dirs in it
+    course_attachment_list = construct_attchment_list(driver=driver, token=token, pid=0, uid=uid, cid=cid)
+
+    # Iteratively add files in dirs to global attachment list
+    dir_counter = 0
+    for entry in course_attachment_list:
+        if (entry.get('ext') == 'dir'):
+            dir_counter += 1
+            # Add dir content to attachment list
+            dir_id = entry.get('id')
+            course_attachment_list.extend(construct_attchment_list(driver=driver, token=token, pid=dir_id, uid=uid, cid=cid))
+
+    print("Get {:d} files, with {:d} dirs".format(len(course_attachment_list)-dir_counter, dir_counter))
+
+    # Download attachments
+    for entry in course_attachment_list:
+        ext = entry.get('ext')
+        if (ext == 'dir') or (ext in ext_expel_list) or (not download_all_ext and ext not in ext_list):
+            continue
+
+        if (ext in entry.get('title')):
+            filename = filename_filter(entry.get('title'))
+        else:
+            filename = filename_filter("{}.{}".format(entry.get('title'), ext))
+
+        filesize = entry.get('size')
+
+        with closing(requests.get(entry.get('path').replace('amp;', ''), stream=True)) as res:
+            content_size = eval(res.headers['content-length'])
+
+            if filename in os.listdir():
+                # If file is up-to date, continue; else, delete and re-download
+                if os.path.getsize(filename) == content_size:
+                    print("File \"{}\" is up-to-date".format(filename))
+                    continue
+                else:
+                    print("Updating File {}".format(filename))
+                    os.remove(filename)
+
+            print("Downloading {}, filesize = {}".format(filename, filesize))
+            chunk_size = min(content_size, 10240)
+            with open(filename, "wb") as f:
+                chunk_count = 0
+                start_time = time.time()
+                # previous_time = time.time()
+                # lag_counter = 0
+                total = content_size / 1024 / 1024
+                for data in res.iter_content(chunk_size=chunk_size):
+                    chunk_count += 1
+                    processed = len(data) * chunk_count / 1024 / 1024
+                    current_time = time.time()
+                    if chunk_count < 5:
+                        print(r"    Total: {:.2f} MB  Processed: {:.2f} MB ({:.2f}%)".format(total, processed, processed/total*100), end = '\r')
+                    else:
+                        remaining = (current_time-start_time)/processed*(total-processed)
+                        print(r"    Total: {:.2f} MB  Processed: {:.2f} MB ({:.2f}%), ETA {:.2f}s".format(total, processed, processed/total*100, remaining), end = '\r')
+                    f.write(data)
+
+                    # speed = chunk_size / 1.0 * (current_time - previous_time)
+                    # if speed < speed_threshold:
+                    #     lag_counter += 1
+                    # else:
+                    #     lag_counter = 0
+
+                    # if lag_counter > 10:
+                    #     print("Restart downloading of file {}".format(filename))
+                    #     attachment_list.append(entry)
+                    #     continue
+
+    os.chdir(r'../') # Switch directory
+
+print("Done!")