童言无忌-Tabooless Babble: 简化排版 Twitter 消息网页存档的 Perl 脚本

2012年6月12日星期二

简化排版 Twitter 消息网页存档的 Perl 脚本

# clean up mobile.twitter messages
# 1. Save twitter message web page as complete html, eg. 01.htm
# NOTE: Chrome browser can't save twitter message in HTML only mode.
#
# 2. Install perl, eg. ActivePerl http://www.activestate.com/activeperl
# 3. Save this script as mobile.twitter_msg_cleanup.pl, in the same folder as in step 1.
# 4. Open a command window, cd to the folder where the files are stored.
# 5. Input command: mobile.twitter_msg_cleanup.pl -h 01.htm > 01o.htm
# 6. Open 01o.htm in browser to check the result.

use HTML::TokeParser;
use HTML::Entities qw(decode_entities);
$html_mode = 0;

if ($#ARGV < 0) {
print "Usage: $0 [-h] <input html file> > <output file>\n-h: Output HTML format, otherwise text format.\n";
exit;
}

if ($ARGV[0] eq "-h") {
$html_mode = 1;
shift;
}

open(my $fh, "<:utf8", (shift || "index.htm")) || die "Can't open file: $!";
$p = HTML::TokeParser->new($fh);

$head=<<EOF;
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
EOF

print $head if $html_mode;

while (my $token = $p->get_tag("td")) {
my $class = $token->[1]{class};
next if $class ne "user-info";
$token = $p->get_tag("strong");
$class = $token->[1]{class};
next if $class ne "fullname";
my $fullname = $p->get_text("/strong");
$token = $p->get_tag("span");
$class = $token->[1]{class};
next if $class ne "username";
$p->get_tag("/span");
my $username = $p->get_trimmed_text("/span");
$token = $p->get_tag("td");
$class = $token->[1]{class};
next if $class ne "timestamp";
$token = $p->get_tag("a");
my $url = $token->[1]{href};
my $time = $p->get_text("/a");
if ($html_mode) {
print "<a href=\"$url\">$fullname $username $time</a><br />\n";
} else {
print "$fullname $username $time $url\n";
}

$token = $p->get_tag("div");
$class = $token->[1]{class};
next if $class ne "tweet-text";
while ($token = $p->get_token) {
if ($token->[0] eq "E" && $token->[1] eq "div") {
if ($html_mode) {
print "<br />\n<br />\n";
} else {
print "\n\n";
}
last;
}
if ($token->[0] eq "T") {
my $text = $token->[1];
decode_entities($text);
print $text;
}
if ($token->[0] eq "S" && $token->[1] eq "a" && $token->[2]{class} eq "twitter_external_link") {
$link = $token->[2]{"href"};
$link_text = $p->get_text("/a");
if ($html_mode) {
print "<a href=\"$link\">$link_text</a>";
} else {
print " $link = $link_text ";
}
}
}
}

print "</body>\n</html>\n" if $html_mode;

# clean up twitter messages
# 1. Save twitter message web page as complete html, eg. 01.htm
# NOTE: Chrome browser can't save twitter message in HTML only mode.
#
# 2. Install perl, eg. ActivePerl http://www.activestate.com/activeperl
# 3. Save this script as twitter_msg_cleanup.pl, in the same folder as in step 1.
# 4. Open a command window, cd to the folder where the files are stored.
# 5. Input command: twitter_msg_cleanup.pl -h 01.htm > 01o.htm
# 6. Open 01o.htm in browser to check the result.

use HTML::TokeParser;
use HTML::Entities qw(decode_entities);
$html_mode = 0;

if ($#ARGV < 0) {
print "Usage: $0 [-h] <input html file> > <output file>\n-h: Output HTML format, otherwise text format.\n";
exit;
}

if ($ARGV[0] eq "-h") {
$html_mode = 1;
shift;
}

open(my $fh, "<:utf8", (shift || "index.htm")) || die "Can't open file: $!";
$p = HTML::TokeParser->new($fh);

$head=<<EOF;
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
EOF

print $head if $html_mode;

while (my $token = $p->get_tag("div")) {
my $class = $token->[1]{class};
next if $class ne "stream-item-header";
$token = $p->get_tag("a");
my $url = $token->[1]{href};
my $time = $token->[1]{title};

$token = $p->get_tag("strong");
$class = $token->[1]{class};
next if $class !~ /^fullname/;
my $fullname = $p->get_text("/strong");
print $fullname;

while ($token = $p->get_tag("span")) {
$class = $token->[1]{class};
if ($class =~ /^username/) {
my $username = $p->get_text("/span");
if ($html_mode) {
print " $username <a href=\"$url\">$time</a><br />\n";
} else {
print " $username $time $url\n";
}
last;
}
}

$token = $p->get_tag("p");
$class = $token->[1]{"class"};
next if $class ne "js-tweet-text";
while ($token = $p->get_token) {
if ($token->[0] eq "E" && $token->[1] eq "p") {
if ($html_mode) {
print "<br />\n<br />\n";
} else {
print "\n\n";
}
last;
}
if ($token->[0] eq "T") {
my $text = $token->[1];
decode_entities($text);
print $text;
}
if ($token->[0] eq "S" && $token->[1] eq "a" && $token->[2]{class} eq "twitter-timeline-link") {
my $link = $token->[2]{"href"};
my $link_expanded = $token->[2]{"data-expanded-url"};
my $link_ultimate = $token->[2]{"data-ultimate-url"};
print "$link = $link_expanded", $link_ultimate ? " = $link_ultimate " : "";
$p->get_tag("/a");
}
}
}

print "</body>\n</html>\n" if $html_mode;

提供两个简单的 Perl 脚本，用于清理和简化保存下来的 Twitter 网页内容，只保留消息文字和关键链接信息，以便于拷贝粘贴到邮件和网志（博客）编辑器中存档和共享。

1. twitter_msg_cleanup.pl http://pastebin.com/J6QqxdBz
适用于从推特主站点 twitter.com 保存的网页

2. mobile.twitter_msg_cleanup.pl http://pastebin.com/Xz9dfSiM
适用于移动版 mobile.twitter.com 保存的网页

使用方法：

1. 安装 perl 解释器，比如 ActivePerl http://www.activestate.com/activeperl
2. 将推特网页保存为 complete html 完整网页（网页，全部）, 比如取名为 01.htm。

注意：这时应该看到除了 htm 文件，同时也产生一个后缀为 _files 的文件夹。虽然不需要这个文件夹里的内容，但存在这个文件夹，才能确信网页内容推特文字消息是完整保存的。
因为 Chrome 浏览器保存网页时如果选择"HTML Only" （网页，仅HTML）方式，将不能完整保存推特文字消息内容。

3. 将本文两个脚本也保存在和第2步中保存的 htm 文件所在的同一个文件夹中。
4. 打开一个命令窗口，切换当前路径到保存有如上文件的文件夹中。
5. 输入命令： twitter_msg_cleanup.pl -h 01.htm > 01o.htm
6. 用浏览器打开 01o.htm 查看结果。
7. 如果不加 "-h" 选项，输出的是文本格式。