Geographical Preferences in Cantopo Lyrics of Albert Leung(林夕)
Digital Humanities Student Project (Fall 2025)
This project is a course project for HUMA5630 Digital Humanities
Photo of Albert Leung, provided by the project author
About This Project
Albert Leung (林夕) is arguably the most influential lyricist in the Chinese pop music, whose contemporary urban lyrics have shaped timeless classics for renowned artists. His unique talent for linking abstract emotions to tangible physical spaces led to our research. In past, some scholars have previously analysed the nature of words, emotional words, and time words in Albert Leung’s lyrics, but there has been no analysis of the specific locations.
Thus this project applies Named Entity Recognition (NER) to analyse geographical imagination in over 1000 Cantopop lyricsof Albert Leung, investigating how he constructs distinct emotional spaces through geography—specifically the dichotomy between Japan (The Distant/Travel) and Hong Kong (The Domestic/Home).
Methodlogy
- Prepare training data: Create a training data comprising 40 representative songs that feature locations by other Hong Kong lyricists. This dataset was split, tagged and converted to BERT’s input format, using a Bert-base-Chinese model for fine-tuning.
- Prepare predicting data: Collect a testing data of over 1,000 Cantopo Lyrics of Albert Leung(林夕) via Python-based web scraping on Feitsui Lyric [https://www.feitsui.com] and then clean the data to remove irrelevant content, aiming to retain only the actual lyrics.
- Fine-tuned NER model with training data: Segment the testing data uses line-by-line processing.
- Visualization: Count the recognized entities and do visualization based on preliminary analysis, including a bar chart, a word cloud and a map.
Figure 1. Word Cloud
Key Features
- Robust Web Scraping: A scraper designed to fetch lyrics from Feitsui Lyrics, featuring automatic retry logic and content cleaning (removing Pinyin/English).
- Named Entity Recognition (NER): Utilization of BERT-based models to extract location entities from unstructured text.
Figure 2. Place Name Frequency
- Data Visualization:
- Statistical Analysis: Bar charts illustrating the frequency of top locations.
- Word Clouds: Visualizing the “geographic atmosphere” of the lyrics.
- Interactive GIS Mapping:
- Integration with Google Earth for precise manual geocoding.
- Generation of an Interactive Web Map using
Folium. - Color-coded Clustering: Red markers for Japan (Travel) vs. Orange markers for Hong Kong (Home), complete with lyric snippets in pop-ups.
Tech Stack
- Language: Python 3.10+
- Data Acquisition:
requests,BeautifulSoup4 - Data Processing:
pandas,re,xml - NLP:
transformers(Hugging Face),jieba(optional) - Visualization:
matplotlib,seaborn,wordcloud,folium
ZHENG Xin
MA Chinese Culture
CHEN Zihui
MA Chinese Culture
YU Liangliang
MA Chinese Culture
LU Zisheng
MA Chinese Culture
GitHub Repository
Please find the detailed code and project documentation at the link below.
