Basic word frequency analysis

Here are some interesting terms in the Democratic presidential debate from 2008:

I believe we're at a defining moment in our history. Our nation is at war; our planet is in peril....


-------------------------------

Total Count of most terms : 9125
Interesting Word Freq Count: 1952
-------------------------------
id=1 ct=112(39.16%) term=think
id=2 ct=101(35.31%) term=applause
id=3 ct=97(33.92%) term=clinton
id=4 ct=97(33.92%) term=people
id=5 ct=85(29.72%) term=senator
id=6 ct=66(23.08%) term=health
id=7 ct=62(21.68%) term=obama
id=8 ct=56(19.58%) term=care
id=9 ct=56(19.58%) term=blitzer
id=10 ct=47(16.43%) term=right
id=11 ct=44(15.38%) term=president
id=12 ct=40(13.99%) term=country
id=13 ct=35(12.24%) term=make
id=14 ct=34(11.89%) term=plan
id=15 ct=32(11.19%) term=question
id=16 ct=30(10.49%) term=believe
id=17 ct=30(10.49%) term=important
id=18 ct=28(9.79%) term=issue
id=19 ct=28(9.79%) term=take
id=20 ct=27(9.44%) term=time
id=21 ct=26(9.09%) term=years
id=22 ct=26(9.09%) term=american
id=23 ct=25(8.74%) term=first
id=24 ct=24(8.39%) term=insurance
id=25 ct=23(8.04%) term=bush
id=26 ct=23(8.04%) term=part
id=27 ct=21(7.34%) term=iraq
id=28 ct=20(6.99%) term=year
id=29 ct=20(6.99%) term=million
id=30 ct=19(6.64%) term=need
id=31 ct=19(6.64%) term=united
id=32 ct=19(6.64%) term=states
id=33 ct=18(6.29%) term=over
id=34 ct=18(6.29%) term=able
id=35 ct=17(5.94%) term=change
id=36 ct=17(5.94%) term=immigration
id=37 ct=17(5.94%) term=trying
id=38 ct=17(5.94%) term=work
id=39 ct=17(5.94%) term=clear
id=40 ct=17(5.94%) term=loo

Contrast this word frequency data with an Obama and Romney debate in 2012:

-------------------------------
Total Count of most terms : 10361
Interesting Word Freq Count: 1853
-------------------------------
id=1 ct=148(40.33%) term=romney
id=2 ct=109(29.70%) term=people
id=3 ct=106(28.88%) term=governor
id=4 ct=102(27.79%) term=president
id=5 ct=101(27.52%) term=make
id=6 ct=89(24.25%) term=obama
id=7 ct=86(23.43%) term=crowley
id=8 ct=72(19.62%) term=jobs
id=9 ct=71(19.35%) term=question
id=10 ct=66(17.98%) term=years
id=11 ct=44(11.99%) term=four
id=12 ct=43(11.72%) term=think
id=13 ct=41(11.17%) term=percent
id=14 ct=40(10.90%) term=country
id=15 ct=40(10.90%) term=energy
id=16 ct=40(10.90%) term=last
id=17 ct=35(9.54%) term=economy
id=18 ct=34(9.26%) term=down
id=19 ct=31(8.45%) term=right
id=20 ct=31(8.45%) term=america
id=21 ct=30(8.17%) term=back
id=22 ct=28(7.63%) term=women
id=23 ct=27(7.36%) term=time
id=24 ct=26(7.08%) term=need
id=25 ct=26(7.08%) term=believe
id=26 ct=26(7.08%) term=able
id=27 ct=26(7.08%) term=good
id=28 ct=26(7.08%) term=million
id=29 ct=25(6.81%) term=folks
id=30 ct=25(6.81%) term=plan
id=31 ct=24(6.54%) term=year
id=32 ct=24(6.54%) term=number
id=33 ct=24(6.54%) term=work
id=34 ct=23(6.27%) term=cant
id=35 ct=23(6.27%) term=american
id=36 ct=23(6.27%) term=done
id=37 ct=23(6.27%) term=small
id=38 ct=23(6.27%) term=place
id=39 ct=23(6.27%) term=part
id=40 ct=22(5.99%) term=over

And here is the GOP debate:



-------------------------------
Total Count of most terms : 12322
Interesting Word Freq Count: 2436
-------------------------------
id=1 ct=156(35.78%) term=king
id=2 ct=103(23.62%) term=people
id=3 ct=97(22.25%) term=right
id=4 ct=93(21.33%) term=president
id=5 ct=80(18.35%) term=question
id=6 ct=74(16.97%) term=states
id=7 ct=73(16.74%) term=government
id=8 ct=57(13.07%) term=think
id=9 ct=52(11.93%) term=need
id=10 ct=51(11.70%) term=john
id=11 ct=51(11.70%) term=governor
id=12 ct=46(10.55%) term=country
id=13 ct=46(10.55%) term=back
id=14 ct=44(10.09%) term=united
id=15 ct=43(9.86%) term=cain
id=16 ct=43(9.86%) term=take
id=17 ct=43(9.86%) term=romney
id=18 ct=42(9.63%) term=hampshire
id=19 ct=42(9.63%) term=paul
id=20 ct=42(9.63%) term=first
id=21 ct=41(9.40%) term=candidates
id=22 ct=40(9.17%) term=jobs
id=23 ct=39(8.94%) term=state
id=24 ct=39(8.94%) term=time
id=25 ct=39(8.94%) term=federal
id=26 ct=38(8.72%) term=pawlenty
id=27 ct=36(8.26%) term=down
id=28 ct=35(8.03%) term=american
id=29 ct=34(7.80%) term=believe
id=30 ct=34(7.80%) term=america
id=31 ct=34(7.80%) term=economy
id=32 ct=33(7.57%) term=years
id=33 ct=32(7.34%) term=obama
id=34 ct=32(7.34%) term=bachmann
id=35 ct=32(7.34%) term=applause
id=36 ct=32(7.34%) term=money
id=37 ct=31(7.11%) term=issue
id=38 ct=31(7.11%) term=thank
id=39 ct=30(6.88%) term=over
id=40 ct=30(6.88%) term=santorum
id=41 ct=30(6.88%) term=look
id=42 ct=29(6.65%) term=program
id=43 ct=28(6.42%) term=work
id=44 ct=26(5.96%) term=things
id=45 ct=26(5.96%) term=care
id=46 ct=25(5.73%) term=make
id=47 ct=25(5.73%) term=percent
id=48 ct=25(5.73%) term=doing
id=49 ct=24(5.50%) term=obamacare
id=50 ct=24(5.50%) term=where
id=51 ct=24(5.50%) term=administration
id=52 ct=24(5.50%) term=national
id=53 ct=24(5.50%) term=private
id=54 ct=24(5.50%) term=other
id=55 ct=23(5.28%) term=republican
id=56 ct=23(5.28%) term=break
id=57 ct=23(5.28%) term=congressman
id=58 ct=23(5.28%) term=tonight
id=59 ct=23(5.28%) term=senator
id=60 ct=23(5.28%) term=questions
id=61 ct=22(5.05%) term=gingrich
id=62 ct=22(5.05%) term=issues
id=63 ct=21(4.82%) term=medicare
id=64 ct=20(4.59%) term=problem
id=65 ct=20(4.59%) term=life
id=66 ct=20(4.59%) term=cant
id=67 ct=20(4.59%) term=wrong
id=68 ct=20(4.59%) term=continue
id=69 ct=20(4.59%) term=party
id=70 ct=20(4.59%) term=tell
id=71 ct=20(4.59%) term=done
id=72 ct=20(4.59%) term=give
id=73 ct=19(4.36%) term=answer
id=74 ct=19(4.36%) term=start
id=75 ct=19(4.36%) term=policy
id=76 ct=19(4.36%) term=congress
id=77 ct=19(4.36%) term=last
id=78 ct=19(4.36%) term=speaker
id=79 ct=18(4.13%) term=thing
id=80 ct=18(4.13%) term=plan
id=81 ct=18(4.13%) term=debate
id=82 ct=18(4.13%) term=point
id=83 ct=17(3.90%) term=shouldnt
id=84 ct=17(3.90%) term=world
id=85 ct=17(3.90%) term=could
id=86 ct=17(3.90%) term=bill
id=87 ct=17(3.90%) term=home
id=88 ct=17(3.90%) term=little
id=89 ct=16(3.67%) term=conversation
id=90 ct=16(3.67%) term=support
id=91 ct=16(3.67%) term=republicans
id=92 ct=16(3.67%) term=didnt
id=93 ct=16(3.67%) term=better
id=94 ct=16(3.67%) term=maybe
id=95 ct=16(3.67%) term=keep
id=96 ct=15(3.44%) term=made
id=97 ct=15(3.44%) term=year
id=98 ct=15(3.44%) term=again

Here are several job resumes:



-------------------------------
Total Count of most terms : 1967
Interesting Word Freq Count: 974
-------------------------------
id=1 ct=38(50.67%) term=software
id=2 ct=21(28.00%) term=linux
id=3 ct=20(26.67%) term=developed
id=4 ct=20(26.67%) term=using
id=5 ct=19(25.33%) term=data
id=6 ct=16(21.33%) term=code
id=7 ct=14(18.67%) term=experience
id=8 ct=13(17.33%) term=engineer
id=9 ct=12(16.00%) term=image
id=10 ct=12(16.00%) term=computer
id=11 ct=11(14.67%) term=java
id=12 ct=10(13.33%) term=programming
id=13 ct=10(13.33%) term=design
id=14 ct=10(13.33%) term=windows
id=15 ct=10(13.33%) term=metrics
id=16 ct=10(13.33%) term=graphics
id=17 ct=9(12.00%) term=languages
id=18 ct=9(12.00%) term=realtime
id=19 ct=9(12.00%) term=over
id=20 ct=9(12.00%) term=maintained
id=21 ct=9(12.00%) term=development
id=22 ct=8(10.67%) term=developer
id=23 ct=8(10.67%) term=used
id=24 ct=8(10.67%) term=algorithms
id=25 ct=8(10.67%) term=machine
id=26 ct=7(9.33%) term=processing
id=27 ct=7(9.33%) term=python
id=28 ct=7(9.33%) term=team
id=29 ct=7(9.33%) term=worked
id=30 ct=7(9.33%) term=helped
id=31 ct=7(9.33%) term=years
id=32 ct=7(9.33%) term=university
id=33 ct=7(9.33%) term=game
id=34 ct=7(9.33%) term=perl
id=35 ct=7(9.33%) term=google
id=36 ct=6(8.00%) term=video
id=37 ct=6(8.00%) term=project
id=38 ct=6(8.00%) term=rendering
id=39 ct=6(8.00%) term=monica
id=40 ct=6(8.00%) term=learning
id=41 ct=6(8.00%) term=senior
id=42 ct=6(8.00%) term=product
id=43 ct=6(8.00%) term=technology
id=44 ct=6(8.00%) term=santa
id=45 ct=6(8.00%) term=application
id=46 ct=6(8.00%) term=engineering
id=47 ct=6(8.00%) term=server
id=48 ct=6(8.00%) term=skills
id=49 ct=6(8.00%) term=shiraz
id=50 ct=6(8.00%) term=research
id=51 ct=5(6.67%) term=advanced
id=52 ct=5(6.67%) term=animation
id=53 ct=5(6.67%) term=applications
id=54 ct=5(6.67%) term=designed
id=55 ct=5(6.67%) term=pipeline
id=56 ct=5(6.67%) term=towards
id=57 ct=5(6.67%) term=port
id=58 ct=5(6.67%) term=optimized
id=59 ct=5(6.67%) term=networking
id=60 ct=5(6.67%) term=audacity
id=61 ct=5(6.67%) term=microsoft
id=62 ct=5(6.67%) term=parallel
id=63 ct=5(6.67%) term=audio
id=64 ct=5(6.67%) term=network
id=65 ct=5(6.67%) term=javascript
id=66 ct=5(6.67%) term=aphrodite
id=67 ct=5(6.67%) term=wrote
id=68 ct=5(6.67%) term=implemented
id=69 ct=5(6.67%) term=technical
id=70 ct=5(6.67%) term=responsible
id=71 ct=5(6.67%) term=custom
id=72 ct=5(6.67%) term=systems
id=73 ct=5(6.67%) term=other
id=74 ct=5(6.67%) term=researched

Here is some data on job descriptions:



-------------------------------
Total Count of most terms : 918
Interesting Word Freq Count: 479
-------------------------------
id=1 ct=23(92.00%) term=experience
id=2 ct=13(52.00%) term=development
id=3 ct=12(48.00%) term=software
id=4 ct=12(48.00%) term=systems
id=5 ct=10(40.00%) term=design
id=6 ct=9(36.00%) term=security
id=7 ct=8(32.00%) term=java
id=8 ct=8(32.00%) term=skills
id=9 ct=8(32.00%) term=plus
id=10 ct=7(28.00%) term=required
id=11 ct=7(28.00%) term=must
id=12 ct=6(24.00%) term=projects
id=13 ct=6(24.00%) term=computer
id=14 ct=6(24.00%) term=strong
id=15 ct=6(24.00%) term=network
id=16 ct=6(24.00%) term=work
id=17 ct=5(20.00%) term=netwitness
id=18 ct=5(20.00%) term=applications
id=19 ct=5(20.00%) term=team
id=20 ct=5(20.00%) term=requirements
id=21 ct=5(20.00%) term=spring
id=22 ct=5(20.00%) term=science
id=23 ct=5(20.00%) term=information
id=24 ct=5(20.00%) term=solutions

Comments

Popular posts from this blog

On Unit Testing, Java TDD for developers to write

Is Java the new COBOL? Yes. What does that mean, exactly? (Part 1)

JVM Notebook: Basic Clojure, Java and JVM Language performance