TREC KBA Stream Corpora in S3

TREC KBA Stream Corpora 2012-2014

These corpora are stored in concatenated Thrift messages described here: https://github.com/trec-kba/streamcorpus, and there are tools available at streamcorpus.org

To obtain the GPG key to decrypt this data, submit a data use agreements to NIST: http://trec.nist.gov/data/kba.html.

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

The TREC 2014 StreamCorpus subsumes the first and second KBA corpora and adds to it -- both in duration and in rich metadata from BBN's Serif. The total size of the data after XZ compression and GPG encryption is 17258989590905 bytes, or 16.1TB -- this is ~3x larger than the 2013, primarily because of the rich NLP tagging information added to the English and Unknown language documents.

s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/
The full list of 2,677,758 file paths to the full 2014 corpus is available here. These paths must be prepended with
```
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/
```
See wget recipe and other tools described below.

579,838,246 of the 1.2B documents have been tagged by Serif. This is the official document set for TREC KBA 2014. This 10.9TB subset has been extracted and stored separately here:
s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/
The full list of 2,669,424 file paths to the serif-only subset of the 2014 corpus is available here. These paths must be prepended with
```
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/
```
The kba-streamcorpus-2014-v0_3_0-kba-filtered corpus is a specially filtered subset of the full 2014 StreamCorpus for use in the KBA CCR and KBA SSF tasks in TREC 2014. It was filtered using surface form names and slot fill strings from the official query entities for KBA 2014.
Stats: 20,494,260 StreamItems (text files) stored in 2,022,998 chunk files and 639GB (xz-compressed)
List of chunk files kba-streamcorpus-2014-v0_3_0-kba-filtered.txt.xz

arxiv 9075

CLASSIFIED 100669

FORUM 480581

linking 189445

MAINSTREAM_NEWS 2819579

MEMETRACKER 265

news 4772483

REVIEW 3011

social 3218775

WEBLOG 8900377

The TREC-TS-2014F corpus is a specially filtered subset of the full 2014 StreamCorpus for use in the Temporal Summarization (TREC-TS) track.

The proportions of different sources is essentially the same as the 2013 corpus listed below.

The most common languages in the corpus are listed below:

Language Percentage

ENGLISH 41.36

Unknown 16.04

GERMAN 11.45

Japanese 7.23

DUTCH 4.27

SPANISH 3.62

RUSSIAN 1.98

FRENCH 1.89

PORTUGUESE 1.88

Chinese 1.64

ITALIAN 1.24

ARABIC 0.73

INDONESIAN 0.68

VIETNAMESE 0.67

GREEK 0.60

SWEDISH 0.51

TURKISH 0.45

MALAY 0.42

POLISH 0.42

PERSIAN 0.35

ChineseT 0.34

Language	Percentage
ENGLISH	41.36
Unknown	16.04
GERMAN	11.45
Japanese	7.23
DUTCH	4.27
SPANISH	3.62
RUSSIAN	1.98
FRENCH	1.89
PORTUGUESE	1.88
Chinese	1.64
ITALIAN	1.24
ARABIC	0.73
INDONESIAN	0.68
VIETNAMESE	0.67
GREEK	0.60
SWEDISH	0.51
TURKISH	0.45
MALAY	0.42
POLISH	0.42
PERSIAN	0.35
ChineseT	0.34

The second TREC KBA streamcorpus subsumes the first corpus and adds to it. The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.

s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/

The second TREC KBA streamcorpus is also available with all non-English documents removed and the StreamItem.body.raw text set to "". This stripped corpus is about 4.5TB and just over 500M StreamItems..

s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/

StreamItems	chunk files	Substream
126,952	11,851	arxiv (full text, abstracts in StreamItem.other_content)
394,381,405	688,974	social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id)
134,933,117	280,658	news (reprocessed from kba-stream-corpus-2012, same stream_id)
5,448,875	12,946	linking (reprocessed from kba-stream-corpus-2012, same stream_id)
396,863,627	927,257	WEBLOG (spinn3r)
57,391,714	164,160	MAINSTREAM_NEWS (spinn3r)
36,559,578	85,769	FORUM (spinn3r)
14,755,278	36,272	CLASSIFIED (spinn3r)
52,412	9,499	REVIEW (spinn3r)
7,637	5,168	MEMETRACKER (spinn3r)
1,040,520,595	2,222,554	Total

The first TREC KBA stream corpus known as kba-stream-corpus-2012 is available here:
s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/

The Wikipedia dump from 2012-01-04 is hosted here: enwiki-20120104-pages-articles.xml.xz

Download

There are many ways to download the corpus, including:

We recommend using Amazon's Elastic Compute Cloud (EC2) and Elastic Map Reduce (EMR) tools to process the corpus. The full list of s3 path suffixes is linked above to help you use these tools.
Useful tools interacting directly with S3 are http://s3tools.org/s3cmd and Boto
The streamcorpus-pipeline tools use boto and have tutorials for processing the entire corpus in AWS for under $500.
For an example that does not use streamcorpus-pipeline's stage infrastructure, see this simplified example: https://github.com/trec-kba/streamcorpus-pipeline/blob/master/examples/verify_kba2014.py

If you have sufficient network bandwidth to the Internet, you can also retrieve the corpus using wget. For example, you can retrieve the 2014 version of the corpus using the commands below. See the TREC KBA discussion forum and the StreamCorpus discussion forum for more details.

This example uses GNU Parallel.

## Fetch the list of directory names -- date-hour strings
wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/dir-names.txt

## Use GNU parallel to make multiple wget requests in parallel.
## The  --continue flag makes this restartable.

cat dir-names.txt | parallel -j 10 --eta 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/{}/index.html'

You can also download the corpus using Amazon Export. For example, you can buy a Lenovo 12TB drive for under $700, and the AWS Calculator estimates between $220 and $240 for the export and shipping to Asia or Europe.

arxiv	9075
CLASSIFIED	100669
FORUM	480581
linking	189445
MAINSTREAM_NEWS	2819579
MEMETRACKER	265
news	4772483
REVIEW	3011
social	3218775
WEBLOG	8900377