Text Analysis using Concordance
This post was automatically copied from Text Analysis using Concordance on eklausmeier.goip.de.
When analyzing longer text, especially if this text was written by oneself, it helps to read the text in a different way, here using a concordance.
Assume your text is provided as PDF. Convert PDF to text using pdftotext
, which is part of package poppler
. Replace line breaks in text file with spaces using below C program (called linebreak.c
):
#include
int main(int argc, char *argv[]) {
int c, flag=0;
FILE *fp;
if (argc >= 2) {
if ((fp = fopen(argv[1],"rb")) == NULL)
return 1;
} else {
fp = stdin;
}
while ((c = fgetc(fp)) != EOF) {
if (c == '\n') {
flag += 1;
if (flag > 1) { putchar(c); flag = 0; }
else putchar(' ');
} else {
flag = 0;
putchar(c);
}
}
return 0;
}
Then generate a list of (single) words with below Perl program:
#!/bin/perl -W
# Print word concordances
use strict;
my (%H,@F);
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
foreach my $w (@F) {
$w =~ s/^\s+//; # ltrim
$w =~ s/\s+$//; # rtrim
$H{$w} += 1;
}
}
foreach my $w (sort keys %H) {
printf("\t%6d\t%s\n",$H{$w},$w);
}
To print all word pairs replace above loop with
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
for(my $i=0; $i<$#F; ++$i) {
$F[$i] =~ s/^\s+//; # ltrim
$F[$i] =~ s/\s+$//; # rtrim
$F[$i+1] =~ s/^\s+//; # ltrim
$F[$i+1] =~ s/\s+$//; # rtrim
$H{$F[$i] . " " . $F[$i+1]} += 1;
}
}
Similar, for word triples replace the loop with
while () {
chomp;
s/\s+$//; # rtrim
@F = split;
for(my $i=0; $i+1<$#F; ++$i) {
$F[$i] =~ s/^\s+//; # ltrim
$F[$i] =~ s/\s+$//; # rtrim
$F[$i+1] =~ s/^\s+//; # ltrim
$F[$i+1] =~ s/\s+$//; # rtrim
$F[$i+2] =~ s/^\s+//; # ltrim
$F[$i+2] =~ s/\s+$//; # rtrim
$H{$F[$i] . " " . $F[$i+1] . " " . $F[$i+2]} += 1;
}
}
Printing concordances using Perl hashes is very simple, as one can see.
Here is an example from the man-page of expect
using below sequence of commands:
( TERM=dumb; man expect ) | linebreak | word3concord | sort -r
Truncated result is
16 For example, the
13 example, the following
12 the current process.
9 the end of
8 using Expectk, this
8 this option is
8 sent to the
8 flag causes the
8 body is executed
8 Expectk, this option
8 (When using Expectk,
7 to the current
7 the spawn id
7 the most recent
7 the current process
7 the corresponding body
7 option is specified
7 is specified as
7 corresponding body is
7 by Don Libes,
7 be used to
6 set for the
6 of the current
6 is set for
6 is an alias