The Boing Boing Filter

Boingboing.net, the Internet's premier Directory of Wonderful Things, is a popular zine-cum-blog with a solid number of contributors. Each writer is going to have certain proclivities, interests, and areas of expertise. Unfortunately, they never bothered to make them explicit, like they do at the old dinosaur newspapers. So I took it upon myself to decide who's been writing about what! To do this, I've harnessed a machine learning approach similar to that of a simple spam filter. Except instead of doing any actual filtering with it, I popped open the hood and put the juiciest statistical associations up on this website.

Any particular word has some probability in appearing in any given post -- take the number of times you saw the word in a post, and divide it by the total number of posts you saw. This probability can be estimated for each word on a small scale, considering only the particular portfolio of posts from any one writer, as well as for the overall blog. Consider a word which appears in a "Blogger X" post with probability p, and its probability in appearing in a non-Blogger-X post is value q. A high ratio p/q indicates a word Blogger X comparatively favors. Blogger X is more likely than the average Boing Boing'er to use that word. A low ratio indicates the word is one the writer comparatively avoids. The actual numbers you see are more or less just the log of these ratios -- a red bar of size 2 means that blogger is 10^2 = 100 times more likely to use the word than the rest of the family.

While the script crunches these ratio numbers for every word it's ever seen, it's only showing you folks the 10 most positve and the 10 negative ratio-ed words. Most results are nifty -- I'd known going in that Cory Doctorow is your go-to blogger for fighting the copyfight, but who knew David Pescovitz was so big into science? We all do, now! Other results might be less nifty, but the good news is as more and more posts accumulate in the sample, the more reliable all the estimates are going to grow more reliable. You can find an archive of past pages here, and you can compare these current results to plain-vanilla word counts by checking here.

Datafiends, keep your eyes on this space: Coming soon will be an archive of past results with each new update, a gallery of the posts that (according to the filters, at least) best exemplify each writer's style.

-- Brian "bgawalt at gmail dot com" Gawalt


These results are based on 3866 total posts scraped since October 1, 2008.
The script has identified 31391 unique words.
All data current as of Tue Apr 7 09:30:35 2009 (PST).

Cory Doctorow tends to favor the words:

dmca
 1.43
parliament
 1.39
graveyard
 1.35
cc
 1.3
superb
 1.3
audiobook
 1.3
broadcasting
 1.3
singularity
 1.3
recommendations
 1.3
guild
 1.3

...but Cory Doctorow tends to avoid the words:

guest
 -1.58
titled
 -1.49
snip
 -1.47
brandon
 -1.41
subculture
 -1.33
choo
 -1.32
updates
 -1.29
animal
 -1.24
platt
 -1.23
ep
 -1.22

Calculated from the 1208 posts recorded for Cory Doctorow.
Cory Doctorow is the only contributer to use the word 'glyn,' but has never once used the word 'metzger.'

Mark Frauenfelder tends to favor the words:

jenny
 1.6
acorn
 1.41
beard
 1.41
ukulele
 1.37
hart
 1.34
programmed
 1.26
danielle
 1.17
ingredients
 1.16
paddy
 1.16
daughters
 1.16

...but Mark Frauenfelder tends to avoid the words:

copyright
 -1.39
commons
 -1.2
joel
 -1.18
choo
 -1.1
canadian
 -1.08
xeni
 -1.08
comment
 -1.06
adult
 -1.05
sites
 -1.05
blogging
 -1.04

Calculated from the 839 posts recorded for Mark Frauenfelder.
Mark Frauenfelder is the only contributer to use the word 'belle,' but has never once used the word 'guestblogger.'

David Pescovitz tends to favor the words:

jess
 1.77
surrealist
 1.56
roq
 1.43
gil
 1.36
zoo
 1.35
loren
 1.26
ryden
 1.26
hopkins
 1.25
electronica
 1.25
pumpkin
 1.25

...but David Pescovitz tends to avoid the words:

copyright
 -1.3
send
 -1.26
legal
 -1.22
itunes
 -1.21
discuss
 -1.16
embed
 -1.15
guest
 -1.14
blogs
 -1.11
policy
 -1.11
totally
 -1.11

Calculated from the 707 posts recorded for David Pescovitz.
David Pescovitz is the only contributer to use the word 'kirsten,' but has never once used the word 'subscribe.'

Xeni Jardin tends to favor the words:

snip
 2.65
susannah
 1.87
benin
 1.84
stills
 1.79
songhai
 1.75
bonner
 1.75
israeli
 1.75
intel
 1.75
johannes
 1.7
recap
 1.7

...but Xeni Jardin tends to avoid the words:

researchers
 -0.98
certain
 -0.94
metzger
 -0.94
main
 -0.93
subculture
 -0.88
easily
 -0.86
useful
 -0.83
drawing
 -0.82
stewart
 -0.81
expensive
 -0.8

Calculated from the 543 posts recorded for Xeni Jardin.
Xeni Jardin is the only contributer to use the word 'permalink,' but has never once used the word 'scientific.'

Richard Metzger tends to favor the words:

metzger
 3.33
snob
 2.26
jeannie
 2.08
prescient
 2.08
daunting
 2.08
mcginley
 2.08
tragically
 2.08
crazed
 2.08
mahalo
 2.08
socialism
 2.08

...but Richard Metzger tends to avoid the words:

number
 -0.5
information
 -0.49
human
 -0.49
link
 -0.48
inside
 -0.44
buy
 -0.43
obama
 -0.42
latest
 -0.41
web
 -0.4
couple
 -0.39

Calculated from the 65 posts recorded for Richard Metzger.
Richard Metzger is the only contributer to use the word 'diminished,' but has never once used the word 'under.'

Brandon Boyer tends to favor the words:

littlebigplanet
 2.93
procedurally
 2.48
robertson
 2.48
xbox
 2.45
ds
 2.41
noby
 2.4
game
 2.35
introversion
 2.34
earthbound
 2.34
mii
 2.34

...but Brandon Boyer tends to avoid the words:

use
 -0.77
said
 -0.71
great
 -0.71
using
 -0.59
people
 -0.59
post
 -0.58
old
 -0.54
blog
 -0.54
lot
 -0.53
john
 -0.52

Calculated from the 55 posts recorded for Brandon Boyer.
Brandon Boyer is the only contributer to use the word 'ragdoll,' but has never once used the word 'boing.'

Rob Beschizza tends to favor the words:

mat
 2.66
infomercia
 2.55
pegasus
 2.42
eee
 2.42
pippin
 2.36
dslr
 2.23
motherboard
 2.23
donned
 2.23
unsparing
 2.23
permaguest
 2.23

...but Rob Beschizza tends to avoid the words:

just
 -0.76
people
 -0.7
way
 -0.7
really
 -0.67
think
 -0.64
great
 -0.64
called
 -0.62
used
 -0.6
making
 -0.5
kind
 -0.47

Calculated from the 47 posts recorded for Rob Beschizza.
Rob Beschizza is the only contributer to use the word 'beheld,' but has never once used the word 'me.'

Dale Dougherty tends to favor the words:

understands
 2.48
prepares
 2.29
cite
 2.29
parkway
 2.29
cage
 2.29
varieties
 2.29
jets
 2.29
pomegranate
 1.99
calgary
 1.99
processor
 1.99

...but Dale Dougherty tends to avoid the words:

post
 -0.45
free
 -0.45
art
 -0.43
guest
 -0.41
posted
 -0.32
world
 -0.32
internet
 -0.32
photos
 -0.32
number
 -0.29
project
 -0.28

Calculated from the 41 posts recorded for Dale Dougherty.
Dale Dougherty is the only contributer to use the word 'banff,' but has never once used the word 'boing.'

Danny Choo tends to favor the words:

resides
 4.88
choo
 4.41
danny
 4.04
subculture
 3.97
blogs
 3.93
guestblogger
 3.73
works
 3.13
life
 2.77
plucked
 2.62
usd
 2.59

...but Danny Choo tends to avoid the words:

book
 -0.6
did
 -0.5
making
 -0.44
long
 -0.44
public
 -0.41
makes
 -0.39
john
 -0.39
real
 -0.36
science
 -0.35
help
 -0.33

Calculated from the 41 posts recorded for Danny Choo.
Danny Choo is the only contributer to use the word 'meiji,' but has never once used the word 'through.'

Susannah Breslin tends to favor the words:

orgy
 2.33
fabrics
 2.33
butterflies
 2.03
nitrate
 2.01
sheridan
 2.01
mortgaging
 2.01
chalayan
 2.01
flocks
 2.01
spatter
 2.01
philippe
 2.01

...but Susannah Breslin tends to avoid the words:

know
 -0.58
years
 -0.57
book
 -0.56
got
 -0.54
best
 -0.52
read
 -0.44
using
 -0.43
post
 -0.42
people
 -0.42
free
 -0.41

Calculated from the 38 posts recorded for Susannah Breslin.
Susannah Breslin is the only contributer to use the word 'showstudio,' but has never once used the word 'really.'

Charles Platt tends to favor the words:

platt
 3.14
automotive
 2.33
cooling
 2.33
populations
 2.33
vegetation
 2.33
emissions
 2.33
ipcc
 2.33
pickup
 2.21
mortality
 2.04
motorola
 2.04

...but Charles Platt tends to avoid the words:

work
 -0.62
said
 -0.55
think
 -0.54
great
 -0.54
got
 -0.54
best
 -0.52
come
 -0.43
post
 -0.42
making
 -0.4
blog
 -0.37

Calculated from the 38 posts recorded for Charles Platt.
Charles Platt is the only contributer to use the word 'aomori,' but has never once used the word 'boing.'

Dan Gillmor tends to favor the words:

gillmor
 2.72
outrage
 2.71
supremely
 2.57
disdain
 2.38
vp
 2.38
geithner
 2.38
bonuses
 2.23
boingboing
 2.1
commentators
 2.08
greenwald
 2.08

...but Dan Gillmor tends to avoid the words:

video
 -0.47
life
 -0.42
free
 -0.36
guest
 -0.33
including
 -0.31
series
 -0.28
year
 -0.28
online
 -0.27
does
 -0.25
like
 -0.25

Calculated from the 34 posts recorded for Dan Gillmor.
Dan Gillmor is the only contributer to use the word 'batteau,' but has never once used the word 'boing.'

ShawnBruce tends to favor the words:

astronomy
 2.41
crafter
 2.41
hills
 2.12
dub
 2.11
crops
 2.09
augustus
 2.09
saturn
 2.09
captcha
 2.09
snide
 2.09
treehugger
 2.09

...but ShawnBruce tends to avoid the words:

video
 -0.76
use
 -0.53
said
 -0.47
called
 -0.45
boing
 -0.41
come
 -0.35
making
 -0.33
including
 -0.28
came
 -0.25
online
 -0.24

Calculated from the 32 posts recorded for ShawnBruce.
ShawnBruce is the only contributer to use the word 'instilled,' but has never once used the word 'than.'

John Brownlee tends to favor the words:

brownlee
 2.69
marveled
 2.66
casio
 2.66
applauded
 2.62
beschizza
 2.58
admired
 2.5
waterproof
 2.47
fujitsu
 2.47
swank
 2.47
aloft
 2.47

...but John Brownlee tends to avoid the words:

work
 -0.48
people
 -0.46
world
 -0.46
know
 -0.44
book
 -0.42
think
 -0.4
just
 -0.32
things
 -0.29
using
 -0.29
music
 -0.27

Calculated from the 28 posts recorded for John Brownlee.
John Brownlee is the only contributer to use the word 'lusted,' but has never once used the word 'me.'

Gareth Branwyn tends to favor the words:

branwyn
 4.5
fringe
 4.2
contributing
 3.98
notebook
 3.9
instructables
 3.83
gareth
 3.83
editing
 3.75
ed
 3.46
writes
 3.31
maker
 3.3

...but Gareth Branwyn tends to avoid the words:

come
 -0.26
free
 -0.24
kind
 -0.2
today
 -0.2
video
 -0.15
science
 -0.14
american
 -0.13
home
 -0.12
live
 -0.11
game
 -0.09

Calculated from the 26 posts recorded for Gareth Branwyn.
Gareth Branwyn is the only contributer to use the word 'cajole,' but has never once used the word 'old.'

Susie Bright tends to favor the words:

bright
 3.36
handles
 2.31
sandra
 2.31
jeans
 2.31
susie
 2.31
stove
 2.31
ethos
 2.28
snagged
 2.28
consenting
 2.28
mobility
 2.28

...but Susie Bright tends to avoid the words:

use
 -0.34
video
 -0.24
used
 -0.23
come
 -0.16
using
 -0.16
post
 -0.15
long
 -0.13
set
 -0.13
public
 -0.11
old
 -0.1

Calculated from the 21 posts recorded for Susie Bright.
Susie Bright is the only contributer to use the word 'pilgrimages,' but has never once used the word 'boing.'

Joel Johnson tends to favor the words:

bop
 2.71
lingerie
 2.41
postponed
 2.38
rightfully
 2.38
stoked
 2.38
fingertips
 2.38
molecules
 2.38
metaplace
 2.38
truckers
 2.38
adept
 2.38

...but Joel Johnson tends to avoid the words:

people
 -0.22
really
 -0.21
think
 -0.18
good
 -0.17
video
 -0.14
day
 -0.06
post
 -0.05
long
 -0.04
love
 -0.03
art
 -0.03

Calculated from the 17 posts recorded for Joel Johnson.
Joel Johnson is the only contributer to use the word 'rubin,' but has never once used the word 'were.'

pspinrad tends to favor the words:

bodily
 3.46
vj
 2.98
paul
 2.89
christianity
 2.81
cynical
 2.51
consciously
 2.51
compression
 2.51
resistant
 2.47
lineage
 2.47
impressions
 2.47

...but pspinrad tends to avoid the words:

boing
 -0.68
best
 -0.06
 
 
 
 
 
 
 
 

Calculated from the 14 posts recorded for pspinrad.
pspinrad is the only contributer to use the word 'spinrad,' but has never once used the word 'little.'

rushkoff tends to favor the words:

rushkoff
 3.24
douglas
 2.9
resurrected
 2.85
biased
 2.54
offshoot
 2.51
outsider
 2.51
lineage
 2.51
marley
 2.51
danceable
 2.51
stoned
 2.51

...but rushkoff tends to avoid the words:

video
 -0.34
good
 -0.05
 
 
 
 
 
 
 
 

Calculated from the 13 posts recorded for rushkoff.
rushkoff is the only contributer to use the word 'centralized,' but has never once used the word 'boing.'

cshirky tends to favor the words:

overlap
 3.49
shirky
 3.09
telecommunications
 2.94
organizing
 2.94
teaches
 2.83
clay
 2.67
template
 2.54
amateurs
 2.54
interactive
 2.51
videomaker
 2.51

...but cshirky tends to avoid the words:

little
 -0.11
used
 -0.01
 
 
 
 
 
 
 
 

Calculated from the 13 posts recorded for cshirky.
cshirky is the only contributer to use the word 'itp,' but has never once used the word 'own.'

Steven Johnson tends to favor the words:

hyperlocal
 4.63
revolution
 3.39
birth
 3.39
invention
 3.35
steven
 3.26
johnson
 3.19
ed
 2.95
elemental
 2.89
aviation
 2.81
community
 2.63

...but Steven Johnson tends to avoid the words:

world
 -0.07
really
 -0.04
 
 
 
 
 
 
 
 

Calculated from the 12 posts recorded for Steven Johnson.
Steven Johnson is the only contributer to use the word 'rooting,' but has never once used the word 'boing.'

John Hodgman tends to favor the words:

eighties
 2.68
sublimely
 2.63
illustrating
 2.63
hoboes
 2.63
rien
 2.63
stressed
 2.63
inhuman
 2.63
fraudulent
 2.63
ups
 2.63
derives
 2.63

...but John Hodgman tends to avoid the words:

new
 -0.58
boing
 -0.17
just
 -0.0
 
 
 
 
 
 
 

Calculated from the 10 posts recorded for John Hodgman.
John Hodgman is the only contributer to use the word 'froud,' but has never once used the word 'us.'

Derek Bledsoe tends to favor the words:

ep
 3.09
updates
 3.01
claypool
 2.98
icon
 2.94
provider
 2.91
archives
 2.82
embed
 2.8
twitter
 2.79
subscribe
 2.74
itunes
 2.73

...but Derek Bledsoe tends to avoid the words:

people
 -0.3
make
 -0.21
 
 
 
 
 
 
 
 

Calculated from the 10 posts recorded for Derek Bledsoe.
Derek Bledsoe is the only contributer to use the word 'metallica,' but has never once used the word 'i.'

dangillmor tends to favor the words:

juxtaposition
 2.98
apathetic
 2.98
mockery
 2.98
quaint
 2.98
stifle
 2.98
indulging
 2.98
indignation
 2.98
collude
 2.98
penned
 2.98
graphical
 2.68

...but dangillmor tends to avoid the words:

 
 
 
 
 
 
 
 
 
 

Calculated from the 5 posts recorded for dangillmor.
dangillmor is the only contributer to use the word 'carr,' but has never once used the word 'boing.'

Teresa Nielsen Hayden Moderator tends to favor the words:

zune
 3.29
crossword
 3.29
fett
 3.11
whitney
 3.11
donor
 3.11
misused
 3.11
likelier
 3.11
iambic
 3.11
copyfighter
 3.11
jargon
 3.11

...but Teresa Nielsen Hayden Moderator tends to avoid the words:

 
 
 
 
 
 
 
 
 
 

Calculated from the 4 posts recorded for Teresa Nielsen Hayden Moderator.
Teresa Nielsen Hayden Moderator is the only contributer to use the word 'moderator,' but has never once used the word 'new.'

Teresa Nielsen Hayden Community Manager tends to favor the words:

cleverness
 3.11
stalked
 3.11
genteel
 3.11
flashed
 3.11
undying
 3.11
rabbi
 3.11
retarded
 3.11
ambivalence
 3.11
tooled
 3.11
tres
 3.11

...but Teresa Nielsen Hayden Community Manager tends to avoid the words:

 
 
 
 
 
 
 
 
 
 

Calculated from the 4 posts recorded for Teresa Nielsen Hayden Community Manager.
Teresa Nielsen Hayden Community Manager is the only contributer to use the word 'shapeways,' but has never once used the word 'who.'



Lots of thanks to Jonathan Soma for all his help on the webdesign.


________