TREC KBA Stream Corpora 2012-2014

These corpora are stored in concatenated Thrift messages described here: https://github.com/trec-kba/streamcorpus, and there are tools available at streamcorpus.org

To obtain the GPG key to decrypt this data, submit a data use agreements to NIST: http://trec.nist.gov/data/kba.html.

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

  1. The TREC 2014 StreamCorpus subsumes the first and second KBA corpora and adds to it -- both in duration and in rich metadata from BBN's Serif. The total size of the data after XZ compression and GPG encryption is 17258989590905 bytes, or 16.1TB -- this is ~3x larger than the 2013, primarily because of the rich NLP tagging information added to the English and Unknown language documents.

    s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/

    The full list of 2,677,758 file paths to the full 2014 corpus is available here. These paths must be prepended with

    http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0/
    See wget recipe and other tools described below.


  2. 579,838,246 of the 1.2B documents have been tagged by Serif. This is the official document set for TREC KBA 2014. This 10.9TB subset has been extracted and stored separately here:

    s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/

    The full list of 2,669,424 file paths to the serif-only subset of the 2014 corpus is available here. These paths must be prepended with

    http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2014-v0_3_0-serif-only/


  3. The kba-streamcorpus-2014-v0_3_0-kba-filtered corpus is a specially filtered subset of the full 2014 StreamCorpus for use in the KBA CCR and KBA SSF tasks in TREC 2014. It was filtered using surface form names and slot fill strings from the official query entities for KBA 2014.

    Stats: 20,494,260 StreamItems (text files) stored in 2,022,998 chunk files and 639GB (xz-compressed)

    List of chunk files kba-streamcorpus-2014-v0_3_0-kba-filtered.txt.xz
    arxiv9075
    CLASSIFIED100669
    FORUM480581
    linking189445
    MAINSTREAM_NEWS2819579
    MEMETRACKER265
    news4772483
    REVIEW3011
    social3218775
    WEBLOG8900377


  4. The TREC-TS-2014F corpus is a specially filtered subset of the full 2014 StreamCorpus for use in the Temporal Summarization (TREC-TS) track.

    The proportions of different sources is essentially the same as the 2013 corpus listed below.

    The most common languages in the corpus are listed below:
    LanguagePercentage
    ENGLISH41.36
    Unknown16.04
    GERMAN11.45
    Japanese7.23
    DUTCH4.27
    SPANISH3.62
    RUSSIAN1.98
    FRENCH1.89
    PORTUGUESE1.88
    Chinese1.64
    ITALIAN1.24
    ARABIC0.73
    INDONESIAN0.68
    VIETNAMESE0.67
    GREEK0.60
    SWEDISH0.51
    TURKISH0.45
    MALAY0.42
    POLISH0.42
    PERSIAN0.35
    ChineseT0.34

  5. The second TREC KBA streamcorpus subsumes the first corpus and adds to it. The total size of the data after XZ compression and GPG encryption is 7096486977581 bytes, or 6.45TB.

    s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/

    The second TREC KBA streamcorpus is also available with all non-English documents removed and the StreamItem.body.raw text set to "". This stripped corpus is about 4.5TB and just over 500M StreamItems..

    s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/

    StreamItemschunk filesSubstream
    126,952 11,851 arxiv (full text, abstracts in StreamItem.other_content)
    394,381,405 688,974 social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id)
    134,933,117 280,658 news (reprocessed from kba-stream-corpus-2012, same stream_id)
    5,448,875 12,946 linking (reprocessed from kba-stream-corpus-2012, same stream_id)
    396,863,627 927,257 WEBLOG (spinn3r)
    57,391,714 164,160 MAINSTREAM_NEWS (spinn3r)
    36,559,578 85,769 FORUM (spinn3r)
    14,755,278 36,272 CLASSIFIED (spinn3r)
    52,412 9,499 REVIEW (spinn3r)
    7,637 5,168 MEMETRACKER (spinn3r)
    1,040,520,5952,222,554Total

  6. The first TREC KBA stream corpus known as kba-stream-corpus-2012 is available here:

    s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/

The Wikipedia dump from 2012-01-04 is hosted here: enwiki-20120104-pages-articles.xml.xz

Download

There are many ways to download the corpus, including: