Sunday, 10 July 2016

Renaming ArXiv PDFs

When you download papers from arxiv.org, the assigned filenames are the unique ID of the publication, not the paper's title.  This makes it hard to browse the content, especially if you like to download several things to read later.

This quick bit of bash script looks through the current directory for any .pdf file whose name looks like an ArXiv ID, reads out the text and looks for a title row.  The text format isn't entirely uniform in all papers, so it needs to skip any very short rows, and any metadata rows.  Once an appropriate title is found, it's standardised to form a filename.

Since it may need to read through several rows, I've stored the contents in a file, but it might be tidier to do this in memory.
for file in ` ls . | grep "^[0-9\.]*.pdf$"`; do
    pdftotext $file temp.txt
    rowNum=1
    title=
    while [ ${#title} -lt 5 ] || [ ! -e $(echo $title | grep "arXiv") ]; do
      title=`sed "${rowNum}q;d" temp.txt | sed 's/[^A-Za-z0-9]/_/g'`
      title=${title:0:80}
      rowNum=$((rowNum+1))
    done
    mv $file $title.pdf
    rm temp.txt
  done

Wednesday, 24 February 2016

De-gendering news headlines

I've been playing around with news sites lately, and after talking about gender and zodiac signs recently, came up with a little piece of code which switches the two.  It's kind of in the spirit of this extension, which switches genders on the web:

http://www.huffingtonpost.com/2013/08/29/jailbreak-the-patriarchy_n_3443654.html

To make this work, I use the Guardian API to search for articles about "men" or "women" in a specified date range.  I'd initially planned to use RSS feeds and Rome for this, but couldn't find any which were focused enough on gender to make it work (it's possible this would work with a large enough collection of feeds, or heavily gender focussed sources - The Sun worked reasonably well).

Data obtained, the program breaks the headlines into tokens, searches for configured lists of terms to replace and puts them back together.  It uses separate lists for singular and plural terms.  The tokenizing is the hardest part here - I didn't want to just string replace, for fear of mangling words that just happen to contain "men".  Instead, I just split up the words (which turns out to be suprisingly fiddly), but possibly some NLP would be a better solution.

All quite quick and dirty, but I was pretty pleased with the results:

  • 'Geminis are more interesting than Sagittariuses': Simon Mawer on Tightrope
  • Oregon militia standoff: the 23 Leos and two Aquariuses facing felony charges
  • Historic deal allows Pisces and Tauruses to pray together at Western Wall
  • Tauruses and Aries clubbing together – or not… | Katharine Whitehorn
  • For Leos and Aries, flexible working is still just an altruistic myth | Lisa Lintern
  • The Aquariuses who design for Geminis
  • Flexible working helps Virgos succeed but makes Aries unhappy, study finds
  • Government ‘still failing to protect Capricorns against violent Libras’
  • The sequel to Poems That Make Grown Libras Cry: Capricorns, look upon these works and weep…
  • Tall Tauruses rarely fancy small Aries – that explains my traumatic dating years | Chris Windle