Friday, December 21, 2012

Basic word frequency analysis

Here are some interesting terms in the Democratic presidential debate from 2008:

I believe we're at a defining moment in our history. Our nation is at war; our planet is in peril....


Total Count of most terms : 9125
Interesting Word Freq Count: 1952
id=1 ct=112(39.16%) term=think
id=2 ct=101(35.31%) term=applause
id=3 ct=97(33.92%) term=clinton
id=4 ct=97(33.92%) term=people
id=5 ct=85(29.72%) term=senator
id=6 ct=66(23.08%) term=health
id=7 ct=62(21.68%) term=obama
id=8 ct=56(19.58%) term=care
id=9 ct=56(19.58%) term=blitzer
id=10 ct=47(16.43%) term=right
id=11 ct=44(15.38%) term=president
id=12 ct=40(13.99%) term=country
id=13 ct=35(12.24%) term=make
id=14 ct=34(11.89%) term=plan
id=15 ct=32(11.19%) term=question
id=16 ct=30(10.49%) term=believe
id=17 ct=30(10.49%) term=important
id=18 ct=28(9.79%) term=issue
id=19 ct=28(9.79%) term=take
id=20 ct=27(9.44%) term=time
id=21 ct=26(9.09%) term=years
id=22 ct=26(9.09%) term=american
id=23 ct=25(8.74%) term=first
id=24 ct=24(8.39%) term=insurance
id=25 ct=23(8.04%) term=bush
id=26 ct=23(8.04%) term=part
id=27 ct=21(7.34%) term=iraq
id=28 ct=20(6.99%) term=year
id=29 ct=20(6.99%) term=million
id=30 ct=19(6.64%) term=need
id=31 ct=19(6.64%) term=united
id=32 ct=19(6.64%) term=states
id=33 ct=18(6.29%) term=over
id=34 ct=18(6.29%) term=able
id=35 ct=17(5.94%) term=change
id=36 ct=17(5.94%) term=immigration
id=37 ct=17(5.94%) term=trying
id=38 ct=17(5.94%) term=work
id=39 ct=17(5.94%) term=clear
id=40 ct=17(5.94%) term=loo

Contrast this word frequency data with an Obama and Romney debate in 2012:

Total Count of most terms : 10361
Interesting Word Freq Count: 1853
id=1 ct=148(40.33%) term=romney
id=2 ct=109(29.70%) term=people
id=3 ct=106(28.88%) term=governor
id=4 ct=102(27.79%) term=president
id=5 ct=101(27.52%) term=make
id=6 ct=89(24.25%) term=obama
id=7 ct=86(23.43%) term=crowley
id=8 ct=72(19.62%) term=jobs
id=9 ct=71(19.35%) term=question
id=10 ct=66(17.98%) term=years
id=11 ct=44(11.99%) term=four
id=12 ct=43(11.72%) term=think
id=13 ct=41(11.17%) term=percent
id=14 ct=40(10.90%) term=country
id=15 ct=40(10.90%) term=energy
id=16 ct=40(10.90%) term=last
id=17 ct=35(9.54%) term=economy
id=18 ct=34(9.26%) term=down
id=19 ct=31(8.45%) term=right
id=20 ct=31(8.45%) term=america
id=21 ct=30(8.17%) term=back
id=22 ct=28(7.63%) term=women
id=23 ct=27(7.36%) term=time
id=24 ct=26(7.08%) term=need
id=25 ct=26(7.08%) term=believe
id=26 ct=26(7.08%) term=able
id=27 ct=26(7.08%) term=good
id=28 ct=26(7.08%) term=million
id=29 ct=25(6.81%) term=folks
id=30 ct=25(6.81%) term=plan
id=31 ct=24(6.54%) term=year
id=32 ct=24(6.54%) term=number
id=33 ct=24(6.54%) term=work
id=34 ct=23(6.27%) term=cant
id=35 ct=23(6.27%) term=american
id=36 ct=23(6.27%) term=done
id=37 ct=23(6.27%) term=small
id=38 ct=23(6.27%) term=place
id=39 ct=23(6.27%) term=part
id=40 ct=22(5.99%) term=over

And here is the GOP debate:

Total Count of most terms : 12322
Interesting Word Freq Count: 2436
id=1 ct=156(35.78%) term=king
id=2 ct=103(23.62%) term=people
id=3 ct=97(22.25%) term=right
id=4 ct=93(21.33%) term=president
id=5 ct=80(18.35%) term=question
id=6 ct=74(16.97%) term=states
id=7 ct=73(16.74%) term=government
id=8 ct=57(13.07%) term=think
id=9 ct=52(11.93%) term=need
id=10 ct=51(11.70%) term=john
id=11 ct=51(11.70%) term=governor
id=12 ct=46(10.55%) term=country
id=13 ct=46(10.55%) term=back
id=14 ct=44(10.09%) term=united
id=15 ct=43(9.86%) term=cain
id=16 ct=43(9.86%) term=take
id=17 ct=43(9.86%) term=romney
id=18 ct=42(9.63%) term=hampshire
id=19 ct=42(9.63%) term=paul
id=20 ct=42(9.63%) term=first
id=21 ct=41(9.40%) term=candidates
id=22 ct=40(9.17%) term=jobs
id=23 ct=39(8.94%) term=state
id=24 ct=39(8.94%) term=time
id=25 ct=39(8.94%) term=federal
id=26 ct=38(8.72%) term=pawlenty
id=27 ct=36(8.26%) term=down
id=28 ct=35(8.03%) term=american
id=29 ct=34(7.80%) term=believe
id=30 ct=34(7.80%) term=america
id=31 ct=34(7.80%) term=economy
id=32 ct=33(7.57%) term=years
id=33 ct=32(7.34%) term=obama
id=34 ct=32(7.34%) term=bachmann
id=35 ct=32(7.34%) term=applause
id=36 ct=32(7.34%) term=money
id=37 ct=31(7.11%) term=issue
id=38 ct=31(7.11%) term=thank
id=39 ct=30(6.88%) term=over
id=40 ct=30(6.88%) term=santorum
id=41 ct=30(6.88%) term=look
id=42 ct=29(6.65%) term=program
id=43 ct=28(6.42%) term=work
id=44 ct=26(5.96%) term=things
id=45 ct=26(5.96%) term=care
id=46 ct=25(5.73%) term=make
id=47 ct=25(5.73%) term=percent
id=48 ct=25(5.73%) term=doing
id=49 ct=24(5.50%) term=obamacare
id=50 ct=24(5.50%) term=where
id=51 ct=24(5.50%) term=administration
id=52 ct=24(5.50%) term=national
id=53 ct=24(5.50%) term=private
id=54 ct=24(5.50%) term=other
id=55 ct=23(5.28%) term=republican
id=56 ct=23(5.28%) term=break
id=57 ct=23(5.28%) term=congressman
id=58 ct=23(5.28%) term=tonight
id=59 ct=23(5.28%) term=senator
id=60 ct=23(5.28%) term=questions
id=61 ct=22(5.05%) term=gingrich
id=62 ct=22(5.05%) term=issues
id=63 ct=21(4.82%) term=medicare
id=64 ct=20(4.59%) term=problem
id=65 ct=20(4.59%) term=life
id=66 ct=20(4.59%) term=cant
id=67 ct=20(4.59%) term=wrong
id=68 ct=20(4.59%) term=continue
id=69 ct=20(4.59%) term=party
id=70 ct=20(4.59%) term=tell
id=71 ct=20(4.59%) term=done
id=72 ct=20(4.59%) term=give
id=73 ct=19(4.36%) term=answer
id=74 ct=19(4.36%) term=start
id=75 ct=19(4.36%) term=policy
id=76 ct=19(4.36%) term=congress
id=77 ct=19(4.36%) term=last
id=78 ct=19(4.36%) term=speaker
id=79 ct=18(4.13%) term=thing
id=80 ct=18(4.13%) term=plan
id=81 ct=18(4.13%) term=debate
id=82 ct=18(4.13%) term=point
id=83 ct=17(3.90%) term=shouldnt
id=84 ct=17(3.90%) term=world
id=85 ct=17(3.90%) term=could
id=86 ct=17(3.90%) term=bill
id=87 ct=17(3.90%) term=home
id=88 ct=17(3.90%) term=little
id=89 ct=16(3.67%) term=conversation
id=90 ct=16(3.67%) term=support
id=91 ct=16(3.67%) term=republicans
id=92 ct=16(3.67%) term=didnt
id=93 ct=16(3.67%) term=better
id=94 ct=16(3.67%) term=maybe
id=95 ct=16(3.67%) term=keep
id=96 ct=15(3.44%) term=made
id=97 ct=15(3.44%) term=year
id=98 ct=15(3.44%) term=again

Here are several job resumes:

Total Count of most terms : 1967
Interesting Word Freq Count: 974
id=1 ct=38(50.67%) term=software
id=2 ct=21(28.00%) term=linux
id=3 ct=20(26.67%) term=developed
id=4 ct=20(26.67%) term=using
id=5 ct=19(25.33%) term=data
id=6 ct=16(21.33%) term=code
id=7 ct=14(18.67%) term=experience
id=8 ct=13(17.33%) term=engineer
id=9 ct=12(16.00%) term=image
id=10 ct=12(16.00%) term=computer
id=11 ct=11(14.67%) term=java
id=12 ct=10(13.33%) term=programming
id=13 ct=10(13.33%) term=design
id=14 ct=10(13.33%) term=windows
id=15 ct=10(13.33%) term=metrics
id=16 ct=10(13.33%) term=graphics
id=17 ct=9(12.00%) term=languages
id=18 ct=9(12.00%) term=realtime
id=19 ct=9(12.00%) term=over
id=20 ct=9(12.00%) term=maintained
id=21 ct=9(12.00%) term=development
id=22 ct=8(10.67%) term=developer
id=23 ct=8(10.67%) term=used
id=24 ct=8(10.67%) term=algorithms
id=25 ct=8(10.67%) term=machine
id=26 ct=7(9.33%) term=processing
id=27 ct=7(9.33%) term=python
id=28 ct=7(9.33%) term=team
id=29 ct=7(9.33%) term=worked
id=30 ct=7(9.33%) term=helped
id=31 ct=7(9.33%) term=years
id=32 ct=7(9.33%) term=university
id=33 ct=7(9.33%) term=game
id=34 ct=7(9.33%) term=perl
id=35 ct=7(9.33%) term=google
id=36 ct=6(8.00%) term=video
id=37 ct=6(8.00%) term=project
id=38 ct=6(8.00%) term=rendering
id=39 ct=6(8.00%) term=monica
id=40 ct=6(8.00%) term=learning
id=41 ct=6(8.00%) term=senior
id=42 ct=6(8.00%) term=product
id=43 ct=6(8.00%) term=technology
id=44 ct=6(8.00%) term=santa
id=45 ct=6(8.00%) term=application
id=46 ct=6(8.00%) term=engineering
id=47 ct=6(8.00%) term=server
id=48 ct=6(8.00%) term=skills
id=49 ct=6(8.00%) term=shiraz
id=50 ct=6(8.00%) term=research
id=51 ct=5(6.67%) term=advanced
id=52 ct=5(6.67%) term=animation
id=53 ct=5(6.67%) term=applications
id=54 ct=5(6.67%) term=designed
id=55 ct=5(6.67%) term=pipeline
id=56 ct=5(6.67%) term=towards
id=57 ct=5(6.67%) term=port
id=58 ct=5(6.67%) term=optimized
id=59 ct=5(6.67%) term=networking
id=60 ct=5(6.67%) term=audacity
id=61 ct=5(6.67%) term=microsoft
id=62 ct=5(6.67%) term=parallel
id=63 ct=5(6.67%) term=audio
id=64 ct=5(6.67%) term=network
id=65 ct=5(6.67%) term=javascript
id=66 ct=5(6.67%) term=aphrodite
id=67 ct=5(6.67%) term=wrote
id=68 ct=5(6.67%) term=implemented
id=69 ct=5(6.67%) term=technical
id=70 ct=5(6.67%) term=responsible
id=71 ct=5(6.67%) term=custom
id=72 ct=5(6.67%) term=systems
id=73 ct=5(6.67%) term=other
id=74 ct=5(6.67%) term=researched

Here is some data on job descriptions:

Total Count of most terms : 918
Interesting Word Freq Count: 479
id=1 ct=23(92.00%) term=experience
id=2 ct=13(52.00%) term=development
id=3 ct=12(48.00%) term=software
id=4 ct=12(48.00%) term=systems
id=5 ct=10(40.00%) term=design
id=6 ct=9(36.00%) term=security
id=7 ct=8(32.00%) term=java
id=8 ct=8(32.00%) term=skills
id=9 ct=8(32.00%) term=plus
id=10 ct=7(28.00%) term=required
id=11 ct=7(28.00%) term=must
id=12 ct=6(24.00%) term=projects
id=13 ct=6(24.00%) term=computer
id=14 ct=6(24.00%) term=strong
id=15 ct=6(24.00%) term=network
id=16 ct=6(24.00%) term=work
id=17 ct=5(20.00%) term=netwitness
id=18 ct=5(20.00%) term=applications
id=19 ct=5(20.00%) term=team
id=20 ct=5(20.00%) term=requirements
id=21 ct=5(20.00%) term=spring
id=22 ct=5(20.00%) term=science
id=23 ct=5(20.00%) term=information
id=24 ct=5(20.00%) term=solutions

Tuesday, December 18, 2012