The Israeli Association of Human Language Technologies

IAHLT Linguistic Datasets

IAHLT specializes in compiling linguistic resources for natural language processing. We collect large corpora in Hebrew and Arabic. The data is annotated by highly skilled linguists for morphology, syntax, and semantics. We adhere to high standards of quality control. A subset of each dataset is annotated by two different annotators and adjudicated by a third person. Our data is released with transparent scores of inter-annotator agreement along with the data on which these scores were calculated.

Download

IAHLT Member Representative:







Datasets

  1. Release 2024-03-07

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of nearly 70k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

    Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

    Our Palestinian Arabic text corpus consists of nearly 175k tokens of transcribed text from 78 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 9697 annotations of 8481 sentences (with a total of 276k tokens annotated, 16896 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.

    Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 2807 annotated sentences (with a total of 39344 tokens annotated, 4304 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.

    Our text summarization preview corpus consists of summaries of 511 articles, 63 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.

    Our coreference resolution corpus consists of 554 articles in Hebrew from the Davar and Israel Hayom news organizations, Knesset protocols, and Hebrew Wikipedia; and 104 articles in Arabic from the Kul Al-Arab news organization and the All Rights entitlement advocacy organization, annotated with morpheme-level spans for coreference chains.

    Our Wikification corpus consists of 19 Arabic articles and 64 Hebrew articles from our named entities corpus, with 934 and 4059 resolutions (respectively) of entities to Wikidata concept entries.

  2. Release 2023-09-06

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 63k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

    Source Genres no. Sentences no. Documents
    Bagatz Supreme Court decisions 173 7
    Davar Hebrew news 13975 375
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 4923 0
    Israel Hayom Hebrew news 1711 30
    Knesset protocols Protocols of the Knesset; each "document" is a separate protocol 4819 897
    Kol Zchut Information about entitlements rights of Israelis 13079 313
    Wikipedia Biographies, events, locations, legal, medical 24023 77

    Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

    Our Palestinian Arabic text corpus consists of 148k tokens of transcribed text from 45 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2010-2014 4 18 2093
    2015-2019 19 484 37469
    2020- 19 942 62975

    Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.

    Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 160 annotated sentences (with a total of 1830 tokens annotated, 485 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.

    Our text summarization preview corpus consists of summaries of 89 Israel Hayom articles, 32 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.

    Our coreference resolution preview corpus 64 paragraphs from 4 articles in Hebrew and 25 paragraphs from 3 articles in Arabic, annotated with morpheme-level spans for coreference chains.

    Our Wikification preview corpus consists of 10 paragraphs from our named entities corpus, with 215 resolutions of entities to Wikidata concept entries.

  3. Release 2023-06-06

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

    Source Genres no. Sentences no. Documents
    Davar Hebrew news 11224 369
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 4213 0
    Israel Hayom Hebrew news 386 7
    Knesset protocols Protocols of the Knesset; each "document" is a separate protocol 2883 100
    Kol Zchut Information about entitlements rights of Israelis 11214 313
    Wikipedia Biographies, events, locations, legal, medical 11309 78

    Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Israel Hayom, Knesset protocols, and Wikipedia 47k 2923
    Arabic Davar, Kul al-Arab, and Wikipedia 75k 8342

    Our Palestinian Arabic corpus consists of 93k tokens of transcribed text from 34 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2010-2014 4 18 2131
    2015-2019 19 484 37026
    2020- 11 686 47129

    Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and Youtube video transcripts. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  4. Release 2023-05-29

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

    Source Genres no. Sentences no. Documents
    Davar Hebrew news 11224 369
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 4213 0
    Israel Hayom Hebrew news 386 7
    Knesset protocols Protocols of the Knesset; each "document" is a separate protocol 2883 100
    Kol Zchut Information about entitlements rights of Israelis 11214 313
    Wikipedia Biographies, events, locations, legal, medical 11309 78

    Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Israel Hayom, Knesset protocols, and Wikipedia 47k 2923
    Arabic Davar, Kul al-Arab, and Wikipedia 75k 8342

    Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2010-2014 4 18 2131
    2015-2019 19 484 37026
    2020- 9 503 44133

    Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5747 annotations of 4646 sentences (with a total of 165723 tokens annotated, 11742 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  5. Release 2023-03-16

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 45k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

    Source Genres no. Sentences no. Documents
    Davar Hebrew news 11224 369
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 4213 0
    Israel Hayom Hebrew news 94 3
    Kol Zchut Information about entitlements rights of Israelis 11214 313
    Wikipedia Biographies, events, locations, legal, medical 11309 75

    Two independent annotations for over 6k of the 36k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 122k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Wikipedia and Israel Hayom 47k 2829
    Arabic Davar, Kul al-Arab, and Wikipedia 75k 8315

    Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2010-2014 4 18 2131
    2015-2019 19 484 37026
    2020- 9 503 44133

    Our Arabic lemmatisation and part-of-speech tagging corpus consists of 4521 annotations of 3521 sentences (with a total of 130037 tokens annotated, 10135 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  6. Release 2022-12-01

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 36k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

    Source Genres no. Sentences no. Complete Documents
    Davar Hebrew news 10432 344
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2685 0
    Kol Zchut Information about entitlements rights of Israelis 10314 290
    Wikipedia Biographies, events, locations, legal, medical 9749 74

    Two independent annotations for over 2.5k of the 33k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 105k paragraphs, nearly 10k documents in Arabic and Hebrew, annotated for 12 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Wikipedia and Israel Hayom 43k 2652
    Arabic Davar and Kul al-Arab 65k 6960

    Our Palestinian Arabic corpus consists of 68k tokens of transcribed text from 27 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2011-2015 5 69 9054
    2016-2020 17 238 26354
    2021- 4 292 25841

    Our Arabic lemmatisation and part-of-speech tagging corpus consists of 2198 annotations of 2100 sentences (with a total of 68260 tokens annotated, 7187 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  7. Release 2022-09-06

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 32k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

    Source Genres no. Sentences no. Complete Documents
    Davar Hebrew news 7563 253
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2685 0
    Kol Zchut Information about entitlements rights of Israelis 10306 289
    Wikipedia Biographies, events, locations, legal, medical 11278 74

    Two independent annotations for over 1.7k of the 30k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 85k paragraphs, nearly 7.5k documents in Arabic and Hebrew, annotated for 12 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Wikipedia and Israel Hayom 32k 2046
    Arabic Davar and Kul al-Arab 53k 5467

    Our Palestinian Arabic corpus consists of 35k tokens of transcribed text from 18 YouTube videos. Eight of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2011-2015 9 138 3037
    2016-2020 19 171 15050
    2021- 5 206 15182

    Also included is a preview release of our Arabic lemmatisation and part-of-speech tagging corpus. The dataset consists of 388 annotations of 292 sentences (with a total of 11058 tokens annotated, 2009 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  8. Release 2022-06-09

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, and a preview sample of our Palestinian Arabic text corpus

    The Hebrew treebank consists of over 24k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

    Source Genres no. Sentences no. Complete Documents
    Davar Hebrew news 386 16
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2662 0
    Kol Zchut Information about entitlements rights of Israelis 9801 285
    Wikipedia Biographies, events, locations, legal, medical 11639 75

    Two independent annotations for over 1k of the 23k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 70k sentences, over 2k documents in Arabic and Hebrew, annotated for 12 entity types.

    Language Source no. Sentences no. Articles
    Hebrew Davar 43k 1200
    Arabic Davar and Kul al-Arab 30k 971

    Our preview release of Palestinian Arabic consists of 12k tokens of transcribed text from 10 YouTube videos. Seven of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

  9. Release 2022-04-12

    The data consists of nearly 20k sentences, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, and the technology magazine Geektime.

    Source Genres no. Sentences no. Complete Documents
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2662 0
    Kol Zchut Information about entitlements rights of Israelis 5576 152
    Wikipedia Biographies, events, locations, legal, medical 11229 75

    Two independent annotations for over 1k of the 18.5k total sentences. Kappa inter-annotator agreement score per-feature.

    Also included are preview samples of named entity annotation, along with the annotation guidelines. Excerpts from Hebrew and Arabic sources were annotated with minimal entity spans for 13 entity types.

    Language Source Sample Size
    Hebrew Wikipedia 39 paragraphs
    Arabic Kol Zchut Arabic 50 sentences
  10. Hebrew-Arabic Parallel Corpus 2021-11-21

    The dataset consists of over 4k parallel articles from the civil-legal domain. High-quality human translation from Hebrew to Arabic.

    Alignment on the sentence level. The metadata includes two scores:

    1. The confidence of the model in the alignment.
    2. The "uniqueness" of the vocabulary of each article compared to the rest of the articles.

    Source Genre no. Articles no. Sentences
    All Rights Government and legal data about rights, covering a wide range of topics: health, old age, religion, immigration, election 4,400 parallel articles 200,000 parallel sentences
  11. Hebrew Treebank 2021-11-21

    The data consists of dozens of articles from Wikipedia and from the technology magazine Geektime.

    Source Genres no. Sentences
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2580
    Wikipedia Biographies, events, locations, legal, medical 6398

    Two independent annotations for over 1.2k of the 9k total sentences. Adjudication by a third annotator. Kappa inter-annotator agreement score per-feature.