GitHub - colinmorris/lalala: Investigating repetitiveness in pop music using compression algorithms

Investigating repetition in pop music. Interested in questions like:

Has pop music been getting more (or less) repetitive over time?
Which songs/artists/genres are the most/least repetitive?

I'm measuring repetitiveness of a song by how well gzip can compress it. (Which sounds cheeky, but I think it can actually be justified when you look at how Lempel-Ziv compression works.)

The investigations from this repo were a precursor to a visual essay I did for Pudding.cool: Are Pop Lyrics Getting More Repetitive?. The code for that essay lives at https://github.com/polygraph-cool/song-repetition

Brief overview of calculating lyric compressibility

Put lyrics in text files (making sure they're ASCII encoded)
Compress those text files using gzip (I used the -9 flag to maximize the compression efficiency)
Run infgen on each gzip file, redirecting the output to a file. (See badromance_infgen.txt for an example of what one of these files looks like)
Run parse_ratio from parse_infgen.py on those files. This returns a tuple of the original (uncompressed) and compressed sizes (in bytes/characters). The ratio of those two numbers will give you the compression ratio.

Roughly speaking, parse_ratio calculates the compressed size using only the Lempel-Ziv part of the DEFLATE compression performed by gzip (and not the Huffman coding part). Infgen is what lets us separate those steps. The compressed size is calculated by treating a 'match' (i.e. a pointer backwards to an earlier portion of the text which is repeated) as costing 3 bytes. This is close to reality, but also just gives intuitively reasonable results for my purposes. You can increase the cost (to put more emphasis on longer repeated sequences, and avoid spurious repetitions on short character sequences) or decrease it - it shouldn't have a huge effect.

Lazy version

Run steps 1 and 2 above, then just look at the ratio between the file sizes of the original (text) files and the gzip files. The disadvantage is that this also incorporates the Huffman coding step (which is not relevant to the natural sense of 'repetitiveness' of song lyrics), and adds a constant amount of overhead (from the gzip header, and the huffman table), which can distort the rankings for very short texts. But overall, the rankings you get with this method will be pretty close to the more principled one above.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
prose		prose
.gitignore		.gitignore
Lyrics.py		Lyrics.py
README.md		README.md
add_genre.py		add_genre.py
badromance_infgen.txt		badromance_infgen.txt
billboard_scrape.py		billboard_scrape.py
common.py		common.py
filter_square_brackets.sh		filter_square_brackets.sh
filter_whisps.sh		filter_whisps.sh
god_frame.py		god_frame.py
lyrics_scrape.py		lyrics_scrape.py
normalizer.py		normalizer.py
notebook_helpers.py		notebook_helpers.py
parse_infgen.py		parse_infgen.py
pop_prose_splits.py		pop_prose_splits.py
retry_lyrics_scrape.py		retry_lyrics_scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brief overview of calculating lyric compressibility

Lazy version

About

Releases

Packages

Languages

colinmorris/lalala

Folders and files

Latest commit

History

Repository files navigation

Brief overview of calculating lyric compressibility

Lazy version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages