Tsinghua LogoAmazing Tsinghua SearchAbout

Tsinghua Search

Author: Vincent Nahn

This is a little fun side project for my Teatime presentation in my Web Information Retrieval Class.

Table of Contents

  • Motivation
  • Technical Implementation
  • Scraping
  • Conclusion
  • Motivation

    If you visit the following link you will be redirected to the official Tsinghua Course Registration website. But you can only visit if you also have a student account.

    Looking at this URL, we can already see the immense efforts put into good user experience. Jokes aside, I first thought the URL was randomly generated which was unique to each student. But it turned out that it was actually hard coded in the source code at the following address:

    https://zhjwe.cic.tsinghua.edu.cn/xkJxs.vxkJxsXkbBs.do?url=/xkJxs.vxkJxsXkbBs.do&m=main&showtitle=0

    This is the website in all its glory:

    Official Tsinghua Course Registration Website

    While using it, I guess every exchange student at Tsinghua experienced this problem. The search function only works for the Chinese Course title. Therefore, we need to consult the huge CSV files which our international coordinator sent us by Email to find relevant courses.

    Search Problems

    This project therefore aims at improving the user experience of searching for Tsinghua courses.

    Technical Implementation

    This is a snippet of the page source.

    <script>
      //处理左侧树的点击事件
      function hitTree(code,txt,deep,isleaf,url,target){
        var URL =url;
        if(URL=="") return;
        if(target!=""){
          window.open(URL, target);
        }else{
          right.location.href = URL;
        }
      }
    
      //显示左侧树
      function showTree(code){
        var u = "xkJxs.vxkJxsXkbBs.do?m=showTree&p_xnxq=2023-2024-2&showtitle=0&jxs_xkjd=退课";
        if(code) u += "&defaultCode=" + code;
        tree.location.href = u;
      }
      //初始化函数
      function init(){
        window.document.title = "选课进修生选课";
        var hreftmp="";
    
        hreftmp="/xkJxs.vxkJxsJxjhBs.do?m=jxsKkxxSearch&p_xnxq=2023-2024-2&showtitle=0";
    
        if(hreftmp==""){
          hreftmp = "zhjw.do?m=showError&zhjwErrMsg=你没有权限访问该页面";
        }
        showTree();
        initTopLocal();
        right.location.href =hreftmp;
      }
      Event.observe(window, "load", init, false);
    </script>
    

    By inserting the URL into the browser we can directly see the iframe in action:

    Iframe

    Scraping

    By having a quick look in the Chrome devtools we can find all the relevant options for the PUT request.

    DevTools

    The response object is an HTML file with `gridData` variable in a script tag. The data is then loaded into the HTML table on the client side.

    <!--定义数据行,第一列为选择框-->
    var gridData = [
      [
        /* "<input type='checkbox' name='p_id' value='org.thcic.zhjw.xkJxs.vo.VxkJxsJxjhBsKey@38717159'>", */
        "Rural Planning towards Coordinated Urban-Rural Development",
        "<a class='mainHref' href='js.vjsKcbBs.do?m=showToXs&p_id=1994990176;00000021' target='_blank'>面向城乡协调的乡村规划</a>",
        "00000021",
        "90",
        "1",
        "建筑学院",
        "<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=1994990176' target='_blank' title='负责教师'>刘健</a>,<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=2016990088' target='_blank' title='非负责教师'>周政旭</a>",
        "5-3(前八周)",
        "",
        ""
      ],
      ...
    ];
    

    Therefore, the current task is to extract the data out of the HTML response and format it into a JSON compatible way. I use Regular Expressions (in Python) to solve the problem.

    1. Extract the `gridData` variable from the Response text:
    2. output = re.findall(r"var gridData.+?];", response.text, re.S)
      Output:
      var gridData = [
        [
          /* "<input type='checkbox' name='p_id' value='org.thcic.zhjw.xkJxs.vo.VxkJxsJxjhBsKey@38717159'>", */
          "Rural Planning towards Coordinated Urban-Rural Development",
          "<a class='mainHref' href='js.vjsKcbBs.do?m=showToXs&p_id=1994990176;00000021' target='_blank'>面向城乡协调的乡村规划</a>",
          "00000021",
          "90",
          "1",
          "建筑学院",
          "<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=1994990176' target='_blank' title='负责教师'>刘健</a>,<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=2016990088' target='_blank' title='非负责教师'>周政旭</a>",
          "5-3(前八周)",
          "",
          ""
        ],
        ...
      ];
      

      We have the data but still need to turn it into a JSON object.

    3. Remove the variable declaration:
    4. data = re.sub(r"var gridData = ", "", output[0])
      Output:
      [
        [
          /* "<input type='checkbox' name='p_id' value='org.thcic.zhjw.xkJxs.vo.VxkJxsJxjhBsKey@38717159'>", */
          "Rural Planning towards Coordinated Urban-Rural Development",
          "<a class='mainHref' href='js.vjsKcbBs.do?m=showToXs&p_id=1994990176;00000021' target='_blank'>面向城乡协调的乡村规划</a>",
          "00000021",
          "90",
          "1",
          "建筑学院",
          "<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=1994990176' target='_blank' title='负责教师'>刘健</a>,<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=2016990088' target='_blank' title='非负责教师'>周政旭</a>",
          "5-3(前八周)",
          "",
          ""
        ],
        ...
      ];
      
    5. Remove the semicolon after the closing bracket:
    6. data = re.sub("];", "]", data)
      Output:
      [
        [
          /* "<input type='checkbox' name='p_id' value='org.thcic.zhjw.xkJxs.vo.VxkJxsJxjhBsKey@38717159'>", */
          "Rural Planning towards Coordinated Urban-Rural Development",
          "<a class='mainHref' href='js.vjsKcbBs.do?m=showToXs&p_id=1994990176;00000021' target='_blank'>面向城乡协调的乡村规划</a>",
          "00000021",
          "90",
          "1",
          "建筑学院",
          "<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=1994990176' target='_blank' title='负责教师'>刘健</a>,<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=2016990088' target='_blank' title='非负责教师'>周政旭</a>",
          "5-3(前八周)",
          "",
          ""
        ],
        ...
      ]
      
    7. Remove all comments
    8. data = re.sub(re.compile(r"(/*.*?*/)", re.DOTALL), "", data)
      Output:
      [
        [
          "Rural Planning towards Coordinated Urban-Rural Development",
          "<a class='mainHref' href='js.vjsKcbBs.do?m=showToXs&p_id=1994990176;00000021' target='_blank'>面向城乡协调的乡村规划</a>",
          "00000021",
          "90",
          "1",
          "建筑学院",
          "<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=1994990176' target='_blank' title='负责教师'>刘健</a>,<a class='mainHref' href='xkJxs.vxkJxsJxjhBs.do?m=showJsDetail&p_jsh=2016990088' target='_blank' title='非负责教师'>周政旭</a>",
          "5-3(前八周)",
          "",
          ""
        ],
        ...
      ]
      

    After the final step we can have the data in JSON format. On tsinghua-search.pages.dev I include the results `output.json` in the public directory. Instead of fetching 10 course items each time a user presses _Next Page_, my server sends everything at once such that further client requests become unnecessary. This works because the number of course items is small (less than 1300) and the `output.json` file has a size of 468KB which is smaller than most images on the web (or most JavaScript packages).

    Conclusion

    Developing a user-friendly website can significantly enhance the user experience, particularly for essential services like course registration. The Tsinghua course registration website, as demonstrated, presents challenges in usability, especially for international students who struggle with the Chinese-only search functionality. By scraping and reformatting the course data, we can provide a more accessible and efficient search experience. This project highlights the importance of accessible design and responsive user interfaces. A well-designed UI/UX is crucial for retaining users and ensuring they can perform necessary tasks efficiently. While users may tolerate subpar interfaces for mandatory services, improving these interfaces can greatly enhance satisfaction and usability. This project not only addresses a specific problem faced by Tsinghua students but also serves as a reminder of the broader impact that thoughtful web design can have on user engagement and satisfaction.