------------------------------------------------------------------------------- Get word counts Discounting 'common' words (assuming ' is a apostrophe) cat file | tr -d "'" | tr -cs A-Za-z '\n' | tr A-Z a-z | fgrep -v -f <(sed '/^#/d;s/ /\n/g;' ~/lib/dict/dict_common | sed '/^$/d') | sort | uniq -c | sort -rn | sed 40q ------------------------------------------------------------------------------- Comparing all files, (linewise) This will find what files in the directory has the least differences! This does a diff between EVERY file. And it does double the amount of work nessary as both "diff A B" and "diff B A" is done. parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3 ------------------------------------------------------------------------------- Comparing files by Words (especially plain English text) The problem with Plain English Text is that words are stored in formatted paragraphs that may not have any line-by-line comparison. That means if you insert a few word, or simply reformat a paragraph for a different text width, everything is different on a line-by-line basis even though not much had actually changed. In my case I am more interested in the opposite, I am looking not for differences, but commonalities. That is sequences of common words in large numbers of files. Such as you get if someone plagiarises a text file. ------------------------------------------------------------------------------- Wdiff -- diff on a word by word basis EG: differences based on white space or non-alphanumeric separations Eventually it breaks up the two input file by white space into a one word per line file to feed into diff. It then runs diff on the results and examines the output. It is while reading the output of diff, that wdiff comes into its own. It reads the change lines (and ignores all else from the diff output) and writes out or skips the words AND THE WHITE SPACE from the appropriate input file, inserting special "deleted" and "inserted" tags to the merged output file. It also uniquely summarizes the results of the diff output lists the total word counts, the words in command, deleted, inserted and changed. This lets you color the output for your terminal. Colored text... wdiff -n \ -w $'\033[30;41m' -x $'\033[0m' \ -y $'\033[30;42m' -z $'\033[0m' \ file1 file2 | less -R Or colored bold text... -w "$(tput bold;tput setaf 1)" -x "$(tput sgr0)" \ -y "$(tput bold;tput setaf 2)" -z "$(tput sgr0)" \ wdiff -w "$(tput bold;tput setaf 1)" -x "$(tput sgr0)" \ -y "$(tput bold;tput setaf 2)" -z "$(tput sgr0)" \ file1 file2 | less -R Or use colordiff (using "less -R" preserves colors) wdiff file1 file2 | colordiff | less -R ------------------------------------------------------------------------------- Word diff with context (now standard) Problem: When a block of text is deleted from a input file, the output of wdiff, locks on to similar words, especially common words (like: a, the, and) and this results in a large blocks of changes in the output until the two input files synchronize again, if ever (rarely). Possible solution... When you break the files into 1-word-per-line for comparison, also include 'context' information around the word for that line. That is you include the words both before and after the specific word for that line. For example this paragraph becomes... ... When you break you break the break the files the files into ... That would make it unlikely for individual common words to be improperly comapared against the same common word in another location. Then when you reconstructed the diff output, you would grab the middle word on each line instead of the entire line (word). This might even be tune-able -- the user could specify whether to add 1, 2, or N extra words on each side of the middle word. But I believe even 1 word on each side would dramatically improve things. Gary Fritz My own notes.. This solution works very well and does improve file synchronization, enormously, especially for 3 word contexts. It does require some extra work to re-align word boundaries, such as when a single word is added/deleted/modified. This is now a standard part of wdiff, and all later versions (gdiff)... GNU has taking over wdiff as a standard utility. ------------------------------------------------------------------------------- Some special usage... Add a column of 'change bars' at the start of every line deleted blocks of text is not shown. wdiff -1n old_file new_file | sed -e 's/^/ /;/{+/s/^ /|/;s/{+//g;s/+}//g' ------------------------------------------------------------------------------- sim_text / sim_words I downloaded and modified "sim_text" program to generate a "sim_words" version which tokenizes the file into simple alphabetic words ignoring all space, punctuation, numbers. As it lexically tokenizes the files, the comparison function (which is not detailed) works extremely fast to compare one or two BIG lists of files. This is memory intensive, but the program uses that memory well. Wdiff only cleans (pre-processes) the original files for analysis by the original 'diff' program which only can compare two 'pre-processed' files at a time. The amount of I/O traffic and command spawning is very high which slows comparisons of large collections down enormously. Unfortunately "sim_words" does not accept filenames from files or streams, or do recursive reading of files in a directory structure, which can make command line limits a problem. Through it does allow comparison of two separate groups of files. It also does not appear to 'sync' the diffs as well as "wdiff" (with a context-matching switch, see above). For example in one case "wdiff -c" of two files found 26% common, while sim_words only found 3% common. This is probably caused by the difference synchronization problem solved using context-matching. ------------------------------------------------------------------------------- New variation: dwdiff http://os.ghalkes.nl/dwdiff.html This supposedly provides more control of 'what is a word'. ------------------------------------------------------------------------------- Git can also do it (not recommended with large changes) git diff --word-diff --no-index -- file1 file2 Add -U# to specify the number of context words (3 by default) You can make an alias of this vi ~/.gitconfig wdiff = diff --color-words --no-index git wdiff file1 file2 -------------------------------------------------------------------------------