These corpora are stored in concatenated Thrift messages described here: https://github.com/trec-kba/streamcorpus, and there are tools available at streamcorpus.org
To obtain the GPG key to decrypt this data, submit a data use agreements to NIST: http://trec.nist.gov/data/kba.html.
For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.
The TREC 2014 StreamCorpus subsumes the first and second KBA corpora and adds to it -- both in duration and in rich metadata from BBN's Serif. The total size of the data after XZ compression and GPG encryption is 17258989590905 bytes, or 16.1TB -- this is ~3x larger than the 2013, primarily because of the rich NLP tagging information added to the English and Unknown language documents.
s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/
The full list of 2,677,758 file paths to the full 2014 corpus is available here. These paths must be prepended with
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/See wget recipe and other tools described below.
579,838,246 of the 1.2B documents have been tagged by Serif. This is the official document set for TREC KBA 2014. This 10.9TB subset has been extracted and stored separately here:
s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/
The full list of 2,669,424 file paths to the serif-only subset of the 2014 corpus is available here. These paths must be prepended with
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/
The kba-streamcorpus-2014-v0_3_0-kba-filtered corpus is a specially filtered subset of the full 2014 StreamCorpus for use in the KBA CCR and KBA SSF tasks in TREC 2014. It was filtered using surface form names and slot fill strings from the official query entities for KBA 2014.
Stats: 20,494,260 StreamItems (text files) stored in 2,022,998 chunk files and 639GB (xz-compressed)
List of chunk files kba-streamcorpus-2014-v0_3_0-kba-filtered.txt.xz
arxiv | 9075 |
CLASSIFIED | 100669 |
FORUM | 480581 |
linking | 189445 |
MAINSTREAM_NEWS | 2819579 |
MEMETRACKER | 265 |
news | 4772483 |
REVIEW | 3011 |
social | 3218775 |
WEBLOG | 8900377 |
The TREC-TS-2014F corpus is a specially
filtered subset of the full 2014 StreamCorpus for use in the Temporal
Summarization (TREC-TS) track.
The proportions of different sources is essentially the same as the 2013 corpus listed below.
The most common languages in the corpus are listed below:
Language | Percentage |
---|---|
ENGLISH | 41.36 |
Unknown | 16.04 |
GERMAN | 11.45 |
Japanese | 7.23 |
DUTCH | 4.27 |
SPANISH | 3.62 |
RUSSIAN | 1.98 |
FRENCH | 1.89 |
PORTUGUESE | 1.88 |
Chinese | 1.64 |
ITALIAN | 1.24 |
ARABIC | 0.73 |
INDONESIAN | 0.68 |
VIETNAMESE | 0.67 |
GREEK | 0.60 |
SWEDISH | 0.51 |
TURKISH | 0.45 |
MALAY | 0.42 |
POLISH | 0.42 |
PERSIAN | 0.35 |
ChineseT | 0.34 |
The second TREC KBA streamcorpus subsumes the first corpus and adds to it. The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.
s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/The second TREC KBA streamcorpus is also available with all non-English documents removed and the StreamItem.body.raw text set to "". This stripped corpus is about 4.5TB and just over 500M StreamItems..
s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/StreamItems | chunk files | Substream |
---|---|---|
126,952 | 11,851 | arxiv (full text, abstracts in StreamItem.other_content) |
394,381,405 | 688,974 | social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id) |
134,933,117 | 280,658 | news (reprocessed from kba-stream-corpus-2012, same stream_id) |
5,448,875 | 12,946 | linking (reprocessed from kba-stream-corpus-2012, same stream_id) |
396,863,627 | 927,257 | WEBLOG (spinn3r) |
57,391,714 | 164,160 | MAINSTREAM_NEWS (spinn3r) |
36,559,578 | 85,769 | FORUM (spinn3r) |
14,755,278 | 36,272 | CLASSIFIED (spinn3r) |
52,412 | 9,499 | REVIEW (spinn3r) |
7,637 | 5,168 | MEMETRACKER (spinn3r) |
1,040,520,595 | 2,222,554 | Total |
The Wikipedia dump from 2012-01-04 is hosted here: enwiki-20120104-pages-articles.xml.xz
There are many ways to download the corpus, including:
Useful tools interacting directly with S3 are http://s3tools.org/s3cmd and Boto
The streamcorpus-pipeline tools use boto and have tutorials for processing the entire corpus in AWS for under $500.
For an example that does not use streamcorpus-pipeline's stage infrastructure, see this simplified example: https://github.com/trec-kba/streamcorpus-pipeline/blob/master/examples/verify_kba2014.py
This example uses GNU Parallel.
## Fetch the list of directory names -- date-hour strings wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/dir-names.txt ## Use GNU parallel to make multiple wget requests in parallel. ## The --continue flag makes this restartable. cat dir-names.txt | parallel -j 10 --eta 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/{}/index.html'