Workshop: Week 1 PRELIM 1. Subversion setup. =========================== We will be working with the Subversion version control system. Please install it in whatever environment you are working in. (If you can't install it, for instance because you don't have admin rights, please let me know. You may then skip to question Q1A) If you are not familiar with Subversion, then please read a tutorial on its use. I have created a shared Subversion repository for the subject, with a directory for each student. Access to the repository is password controlled, and authorization controls will prevent you from accessing another student's repository. Login is using your unimelb username. You should have already received your password; if not, let me know. If your username is USER, then the URL for the working directory in your subversion repository is: http://svn.codalism.com/svn/comp90042-2014s1/USER Please check this working directory out onto your working machine. At the command line, that will look something like: svn co --username USER --password PASSWORD \ http://svn.codalism.com/svn/comp90042-2014s1/USER and you will end up with a subdirectory in your current directory called USER. I'll refer to this as the "working directory". Now enter your working directory, and create a subdirectory called "workshops". Add this subdirectory and commit to subversion: cd USER mkdir workshops svn add workshops svn commit And create, add, and commit a subdirectory for this week: cd workshops mkdir week1 svn add week1 svn commit For each question in this week's workshop, create a separate subdirectory, as: week1/q3 week1/q4 week1/q5 and so forth. Follow the same procedure for each week's workshops. When you finish the workshop, please add and commit your files to subversion. Please do _not_ add data files, as this will take up too much space in the repository. PRELIM 1A. Subversion setup failed. =================================== If you are unable to setup subversion, then please do the following exercises on your local machine. As described above, create a subdirectory for each question, and copy files between these subdirectories as necessary. When you have finished the exercises, please zip up the answers and email the zip file to me at william@williamwebber.com. Please do _NOT_ include data files in the zip file. PRELIM 2. Data =============== We'll be working with a subset of just over 30,000 documents from the RCV1v2 document collection mentioned in the lecture. More specifically, we'll be working with a subset of the pre-tokenized version of this (for convenience, and for copyright reasons). Our dataset can be downloaded from: http://www.williamwebber.com/research/teaching/comp90042/2014s1/wksh/wk1/dat/lyrl_tokens_30k.dat.gz (The original tokenized dataset is available from http://jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm) When you download this data, place in a directory _outside_ the Subversion working directory. That way, you can be confident that it will _not_ get added to the Subversion repository. (If it does, the repository will grow rather too large.) After downloading it, uncompress it, and have a look at the contents, to get a feel for how the data is represented. I'll refer to this as the LYRL dataset. (LYRL are the initials of the researchers who put it together, lead by Dave Lewis.) PRELIM 3. Code =============== I have written some wrapper code to parse the LYRL dataset into and provide a simple API for accessing it. This code can be downloaded from: http://www.williamwebber.com/research/teaching/comp90042/2014s1/wksh/wk1/code/coll.py Read through the code and check that you understand what it does. I've commented it extensively, both documentation code (in """ """) and explanatory code to help you get started with Python. Run this code and check that it and your Python setup works. You will reuse this code in many of the following questions. (You're free to write your own version if you really want to, but there's no need.) Each time you reuse it, please copy it afresh into the subdirectory for that question (and modify it there, if you need to). Yes, this is not good code reuse, but it's the cleanest way of approach these questions. =============================================================== Question 0. =========== (This is only for Python practice. If you're comfortable with Python, skip to Question 1---though you may still want to do this one to sanity-check your environment.) Write a program that loads an LYRL collection, and prints out the document ids and the number of terms in each document. Do this by writing a "main" section marked: if __name__ == '__main__': and a separate function called: def get_docinfo(coll): that takes a BowColl as its argument, and returns a dictionary in which the keys are the docids, and the values are the document lengths. Load the collection in the "main" section; pass it to this function; then print out the document ids and lengths in the "main" section. Try to print out the documents ids in alphabetical order. Question 1: Vocabulary growth ============================= We talked in the lecture about growth in vocabulary size as documents are added. Calculate this growth for our LYRL dataset. To do this, iterate over the documents in the collection, and then the terms in each document. Record the cumulative vocabulary you see in a dictionary (you can just set the keys to "True", or you can use a Python set() object if you prefer). After each document, calculate the _increase_ in vocabulary size for this document. At what rate are new terms being added at the end of the collection (say, the average number of new terms per document in the last 1,000 documents)? Graph the _cumulative_ number of terms per document over time (use whatever graphing program you like; Excel is ok, or you can use the Python graph library). I suggest that you implement this by writing a function that takes a BowDoc and a dictionary, and adds new terms from the BowDoc to the dictionary. Question 2: DF calculation ========================== In a separate file, write a function that takes a BowColl object (from coll.py) as its argument, and calculates the vocabulary of the collection, and the number of documents each term in that vocabulary occurs in. (This is known as the _document frequency_ of the term, and is required for calculating the IDF normalization factor.) The function should return a Python dictionary, in which the keys are the terms, and the values are the document frequencies. Write a wrapper main section that takes the dataset file as its sole argument, and prints out the term frequency information, as: TERM DOC_FREQ one line per term. - i.) What are the ten most frequent terms in the collection? - ii.) Find ten terms that only occur once in the collection. (Place the answers to i. and ii. in a file called "answer.txt".) Question 3: TF*IDF calculation ============================== Write a function that takes a BowDoc and a dictionary containing DF values for the terms in the collection (you'll want to reuse your code from Question 2 for this), and returns the TF*IDF scores for the document in the collection, as a dictionary. Use the following formula for TF*IDF: log(TF + 1) * log(N/DF) ** AMENDMENT 05/03: The function will also need to take the number of documents in the collection as an argument. Write wrapper code that takes a document ID as an argument, and prints out the TF*IDF scores for that document. Question 4: Length normalization ================================ As mentioned in the lecture, the term vector for a document should have "unit length". Modify the function in Question 3 (again, make sure you copy this all into its own sub-directory) so that it gives length-normalized TF*IDF scores. NOTE ON PROGRESS: ================= If you're new to Python programming, you're probably exhausted by now. Just think about how the next two questions could be answered, and the operations that would be involved in them. And be sure to check the answers provided by me next week. And you can still answer Questions 7 and 8. Question 5: Similarity computation ================================== Now modify (again, copy first) the program in Question 4 so that it takes two document ids as its arguments, and calculates the cosine similarity between these two documents. (You probably want to do this calculation as a separate function.) Use this function to calculate the similarity between the following document pairs: 26152 26159 26152 26413 Which pairs are more similar? Does this make sense given the contents of the documents themselves? (See the APPENDIX to this worksheet for the text of these documents.) Question 6: Find similar computation ==================================== OK, so now generalize your code in Question 5 so that it takes a single document ID, and finds the other document in the collection that is most similar to that document (make sure you don't match the document against itself)! Use this code to find the document that is most similar to Document 26413. (If you want the original text for the document you find, then let me know, and I'll email it to you.) THINKING POINT: note that the same code (with some minor changes) could be used to rank the other documents in the collection by their decreasing similarity to a specified document. Question 7: Compact representation ================================== In the lectures, we have been talking about the Term--Document Matrix, or TDM. We said that, even for this small collection, the full TDM would be hundreds of millions of cells in size (30,000 documents, around 15,000 terms in the vocabulary). However, you have been able to perform similarity computations without "instantiating" the full matrix. How? What in effect has your representation of the TDM been? Currently, you calculate everything on the fly, but if you wanted to create an in-memory index of the collection for the above programs, how might you do it? How big would this index be (roughly)? Question 8: Term similarity in document space ============================================= We also talked in the lecture about comparing not just document similarity in term space, but term similarity in document space. The above representation would not be very useful for performing this computation (for instance, for saying "how similar are the terms 'soccer' and 'cricket'); why not? If we had the full TDM instantiated, how might we calculate term similarity? How might you construct a more compact representation just for calculating term similarity? ================================================================= = APPENDIX ================================================================= ======================== = TEXT TO DOCUMENT 26152 ======================== ITALY: ATHLETICS-THIRD TIME LUCKY FOR KOMEN AS HE BREAKS WORLD RECORD. ATHLETICS-THIRD TIME LUCKY FOR KOMEN AS HE BREAKS WORLD RECORD. Paul Hanna RIETI, Italy 1996-09-01

Daniel Komen of Kenya made it third time lucky on Sunday when he shattered Noureddine Morceli's 3,000 metres world record by nearly five seconds at an international meeting.

The 20-year-old Kenyan, who failed to qualify for the Atlanta Olympics, clocked seven minutes 20.67 seconds, breaking Morceli's time of 7:25.11 set two years ago in Monte Carlo.

The Kenyan failed to qualify for the Olympics but has been on blistering form on the grand prix circuit since and was only 0.05 of a second outside Morceli's mark in Monaco last month.

He then clocked 7:25.87 at the Brussels grand prix on August 23, the third fastest time in history, with Morceli finishing well back in sixth place.

Kenyan David Kisang led the field through 1,000 metres in 2:25.89 before Komen took the lead, clocking 4:53.18 at the 2,000 mark and carrying on to finish more than 20 seconds ahead of his nearest rival.

Kenyas's Shem Kororia, was second in 7:43.17 with Italian Gennaro di Napoli third in 7:46.39.

Morceli, the Olympic and world 1,500 metres champion, said he would be back.

"Komen is young and very good and deserves today's result," Morceli told reporters. "I'm sure he has the means to do other things, but I've still got something to say."

Morceli, who holds the world 1,500 metre and mile records, comfortably won the 1,500 in 3:29.99 ahead of Burundi's Olympic 5,000 metres champion Venuste Niyongabo.

Kenyan-born Wilson Kipketer, who runs for Denmark, set the third fastest time in history in the 800 metres clocking one minute 41.83 seconds.

(c) Reuters Limited 1996
======================== = TEXT TO DOCUMENT 26159 ======================== ITALY: ATHLETICS-KOMEN FINALLY BREAKS MORCELI RECORD. ATHLETICS-KOMEN FINALLY BREAKS MORCELI RECORD. RIETI, Italy 1996-09-01

Daniel Komen, who has twice been within a fraction of a second of Noureddine Morceli's world 3,000 metres record, shattered the mark by nearly five seconds on Sunday.

The 20-year-old Kenyan clocked seven minutes 20.67 seconds for the non-championship event at an international meeting to break the Algerian's two-year-old mark of 7:25.11.

Kenyan David Kisang took the field through 1,000 metres in 2:25.89 before Komen took the lead, clocking 4:53.18 at the 2,000 mark and carrying on to finish more than 20 seconds ahead of his nearest rival.

Kenyas's Shem Kororia, was second in 7:43.17 with Italian Gennaro di Napoli third in 7:46.39.

Komen, who did not make the Kenyan team for the Atlanta Olympics, was only 0.05 of a second outside Morceli's mark in Monaco last month.

He clocked 7:25.87 at the Brussels grand prix on August 23, the third fastest time in history, with Morceli finishing well back in sixth place.

In the previous week Komen came close to breaking Ethiopian Haile Gebreselassie's world 5,000 metres record, clocking the second fastest time in history of 12:45.09 at the Zurich grand prix.

(c) Reuters Limited 1996
======================== = TEXT TO DOCUMENT 26413 ======================== INDONESIA: Indonesian rights body report on riots leaves gaps. Indonesian rights body report on riots leaves gaps. Jim Della-Giacoma JAKARTA 1996-09-01

A report by Indonesia's official human rights body saying five people died in riots in Jakarta in July and 74 were missing left many questions unanswered, human rights activists said on Sunday.

The unrest, the worst in the capital in more than two decades, broke out after police raided the headquarters of the Indonesian Democratic Party (PDI) which was occupied by supporters of ousted party leader Megawati Sukarnoputri, daughter of Indonesia's late founding president Sukarno.

The two-page preliminary report of the National Commission for Human Rights, released on Saturday night, added one to the death toll reported by the Indonesian military and said 149 were injured, including security force members, in the riots on July 27.

"There is actually nothing new except that the commission added one more confirmed death. Things are still very unclear," said Kwik Gian Gie, a PDI member.

"The commission's report is more or less the same as PDI data with so many missing, (but) it is impossible up until now to confirm if those missing are dead or not," Kwik told Reuters.

The commission named the dead men as Uju bin Asep, 31, Asmayadi Soleh, 19, Suganda Siagian, 21, Slamet, 52, and Sariwan, 40, but did not give details of how they died.

"We did not focus our investigation on when, how and why they died, went missing or were injured. We mainly relied on documented data from the hospitals," commission member Soegiri said.

Commission secretary-general Bahruddin Lopa told a news conference that those missing could not be assumed dead as they could be hiding out of fear, resting or just staying out of public view.

Lopa said information on those missing was incomplete and contradictory. "This information needs to be checked and checked again to see if they are indeed lost," he added.

"We hope this brief statement will lessen the atmosphere of uncertainty and speculation," commission deputy chariman Marzuki Darusman said, adding the commission would provide a full report on the riots in the near future.

Sidney Jones, executive director of the New York-based Human Rights Watch/Asia, told Reuters she was surprised that the commission went public with such a high number of missing.

"I think it was very courageous of the commission to come out with that figure because that it still makes the figure of only five dead questionable," Jones said.

"With 74 people that even the commission regards as having disappeared, the potential for many more deaths is still very much alive," she said.

Human Rights Watch/Asia and the Washington-based Robert F. Kennedy Memorial Centre for Human Rights condemned the Indonesian government for its crackdown after the riots.

President Suharto blamed the riots on the small left-wing People's Democratic Party (PRD) whose leaders were rounded up and branded Communists. They face subversion charges which carry a maximum penalty of death.

The PRD members are part of the more than 250 people detained after the riots. About 130 are still in detention awaiting trial.

The PDI's Megawati was replaced as party leader in June by deputy parliamentary speaker Surjadi at a government-backed congress in Medan, North Sumatra.

(c) Reuters Limited 1996