TF-IDF Term Frequency-Inverse Document Frequency
TF-IDF (Term Frequency-Inverse Document Frequency) is an algorithm for determining the most-relevant terms in a document within a large collection of documents. One example is Google looking for important terms in a vast collection of websites. I attempted a more modest analysis to find the most important term in each of the works of Shakespeare.
I wrote a collection of Java MapReduce programs to run under Hadoop. My computer setup was/is the following:
- Ubuntu 14.04 64-bit
(Windows and Mac will work equally-well as hosts but you need more than 4G of RAM on any system) - VirtualBox 4.3.20
- Cloudera QuickStart VM cdh 5.3.0
- Hadoop 2.5.0
- Eclipse Juno
- Java 1.7
The whole program consists of a MapReduce driver and 5 Mapper/Reducer steps:
- TF-IDF MR Driver
- TF-IDF MR Step 1
- TF-IDF MR Step 2
- TF-IDF MR Step 3
- TF-IDF MR Step 4
- TF-IDF MR Step 5
- Most-Significant Terms in the Works of Shakespeare
NOTES: This algorithm may run into problems at scale either because of memory usage or because of the startup/teardown cost of numerous steps.
- Memory is likely to be the larger problem
- In the first step you can implement stemming to reduce the number of terms and you can include certain high frequency terms in the stop word list.
- The memory problems will be more severe in the reducer steps and implementing combiners may help by offloading some of the memory usage to the mapper tasks, which are likely to be more numerous than the reducer tasks.
- Instead of using in-memory HashMaps, you can implement temporary disk storage.
- The second and fourth steps can fairly easily be folded into the previous steps. There are two penalties for this: (1) the code is more complex and harder to understand, explain and debug (2) the individual tasks will require more resources.
My thanks to the following for the help I derived from their websites:
My thanks also to Jure Leskovec and Daniel Templeton for teaching Stanford's CS246 and CS246H
(2793)