The Israeli Association of Human Language Technologies

IAHLT Linguistic Datasets

IAHLT specializes in compiling linguistic resources for natural language processing. We collect large corpora in Hebrew and Arabic. The data is annotated by highly skilled linguists for morphology, syntax, and semantics. We adhere to high standards of quality control. A subset of each dataset is annotated by two different annotators and adjudicated by a third person. Our data is released with transparent scores of inter-annotator agreement along with the data on which these scores were calculated.

Download

Datasets

Release 2024-03-07
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of nearly 70k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

Our Palestinian Arabic text corpus consists of nearly 175k tokens of transcribed text from 78 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 9697 annotations of 8481 sentences (with a total of 276k tokens annotated, 16896 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.

Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 2807 annotated sentences (with a total of 39344 tokens annotated, 4304 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.

Our text summarization preview corpus consists of summaries of 511 articles, 63 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.

Our coreference resolution corpus consists of 554 articles in Hebrew from the Davar and Israel Hayom news organizations, Knesset protocols, and Hebrew Wikipedia; and 104 articles in Arabic from the Kul Al-Arab news organization and the All Rights entitlement advocacy organization, annotated with morpheme-level spans for coreference chains.

Our Wikification corpus consists of 19 Arabic articles and 64 Hebrew articles from our named entities corpus, with 934 and 4059 resolutions (respectively) of entities to Wikidata concept entries.

Release 2023-09-06

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of over 63k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

Source	Genres	no. Sentences	no. Documents
Bagatz	Supreme Court decisions	173	7
Davar	Hebrew news	13975	375
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	4923	0
Israel Hayom	Hebrew news	1711	30
Knesset protocols	Protocols of the Knesset; each "document" is a separate protocol	4819	897
Kol Zchut	Information about entitlements rights of Israelis	13079	313
Wikipedia	Biographies, events, locations, legal, medical	24023	77

Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

Our Palestinian Arabic text corpus consists of 148k tokens of transcribed text from 45 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year	Videos	Duration (min)	no. Tokens
2010-2014	4	18	2093
2015-2019	19	484	37469
2020-	19	942	62975

Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.

Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 160 annotated sentences (with a total of 1830 tokens annotated, 485 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.

Our text summarization preview corpus consists of summaries of 89 Israel Hayom articles, 32 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.

Our coreference resolution preview corpus 64 paragraphs from 4 articles in Hebrew and 25 paragraphs from 3 articles in Arabic, annotated with morpheme-level spans for coreference chains.

Our Wikification preview corpus consists of 10 paragraphs from our named entities corpus, with 215 resolutions of entities to Wikidata concept entries.

Release 2023-06-06

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

Source	Genres	no. Sentences	no. Documents
Davar	Hebrew news	11224	369
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	4213	0
Israel Hayom	Hebrew news	386	7
Knesset protocols	Protocols of the Knesset; each "document" is a separate protocol	2883	100
Kol Zchut	Information about entitlements rights of Israelis	11214	313
Wikipedia	Biographies, events, locations, legal, medical	11309	78

Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

Language	Source	no. Paragraphs	no. Articles
Hebrew	Davar, Israel Hayom, Knesset protocols, and Wikipedia	47k	2923
Arabic	Davar, Kul al-Arab, and Wikipedia	75k	8342

Our Palestinian Arabic corpus consists of 93k tokens of transcribed text from 34 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year	Videos	Duration (min)	no. Tokens
2010-2014	4	18	2131
2015-2019	19	484	37026
2020-	11	686	47129

Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and Youtube video transcripts. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Release 2023-05-29

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

Source	Genres	no. Sentences	no. Documents
Davar	Hebrew news	11224	369
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	4213	0
Israel Hayom	Hebrew news	386	7
Knesset protocols	Protocols of the Knesset; each "document" is a separate protocol	2883	100
Kol Zchut	Information about entitlements rights of Israelis	11214	313
Wikipedia	Biographies, events, locations, legal, medical	11309	78

Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

Language	Source	no. Paragraphs	no. Articles
Hebrew	Davar, Israel Hayom, Knesset protocols, and Wikipedia	47k	2923
Arabic	Davar, Kul al-Arab, and Wikipedia	75k	8342

Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year	Videos	Duration (min)	no. Tokens
2010-2014	4	18	2131
2015-2019	19	484	37026
2020-	9	503	44133

Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5747 annotations of 4646 sentences (with a total of 165723 tokens annotated, 11742 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Release 2023-03-16

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of over 45k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.

Source	Genres	no. Sentences	no. Documents
Davar	Hebrew news	11224	369
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	4213	0
Israel Hayom	Hebrew news	94	3
Kol Zchut	Information about entitlements rights of Israelis	11214	313
Wikipedia	Biographies, events, locations, legal, medical	11309	75

Two independent annotations for over 6k of the 36k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 122k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.

Language	Source	no. Paragraphs	no. Articles
Hebrew	Davar, Wikipedia and Israel Hayom	47k	2829
Arabic	Davar, Kul al-Arab, and Wikipedia	75k	8315

Year	Videos	Duration (min)	no. Tokens
2010-2014	4	18	2131
2015-2019	19	484	37026
2020-	9	503	44133

Our Arabic lemmatisation and part-of-speech tagging corpus consists of 4521 annotations of 3521 sentences (with a total of 130037 tokens annotated, 10135 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Release 2022-12-01

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of over 36k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

Source	Genres	no. Sentences	no. Complete Documents
Davar	Hebrew news	10432	344
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	2685	0
Kol Zchut	Information about entitlements rights of Israelis	10314	290
Wikipedia	Biographies, events, locations, legal, medical	9749	74

Two independent annotations for over 2.5k of the 33k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 105k paragraphs, nearly 10k documents in Arabic and Hebrew, annotated for 12 entity types.

Language	Source	no. Paragraphs	no. Articles
Hebrew	Davar, Wikipedia and Israel Hayom	43k	2652
Arabic	Davar and Kul al-Arab	65k	6960

Our Palestinian Arabic corpus consists of 68k tokens of transcribed text from 27 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year	Videos	Duration (min)	no. Tokens
2011-2015	5	69	9054
2016-2020	17	238	26354
2021-	4	292	25841

Our Arabic lemmatisation and part-of-speech tagging corpus consists of 2198 annotations of 2100 sentences (with a total of 68260 tokens annotated, 7187 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Release 2022-09-06

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.

The Hebrew treebank consists of over 32k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

Source	Genres	no. Sentences	no. Complete Documents
Davar	Hebrew news	7563	253
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	2685	0
Kol Zchut	Information about entitlements rights of Israelis	10306	289
Wikipedia	Biographies, events, locations, legal, medical	11278	74

Two independent annotations for over 1.7k of the 30k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 85k paragraphs, nearly 7.5k documents in Arabic and Hebrew, annotated for 12 entity types.

Language	Source	no. Paragraphs	no. Articles
Hebrew	Davar, Wikipedia and Israel Hayom	32k	2046
Arabic	Davar and Kul al-Arab	53k	5467

Our Palestinian Arabic corpus consists of 35k tokens of transcribed text from 18 YouTube videos. Eight of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Year	Videos	Duration (min)	no. Tokens
2011-2015	9	138	3037
2016-2020	19	171	15050
2021-	5	206	15182

Also included is a preview release of our Arabic lemmatisation and part-of-speech tagging corpus. The dataset consists of 388 annotations of 292 sentences (with a total of 11058 tokens annotated, 2009 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

Release 2022-06-09

The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, and a preview sample of our Palestinian Arabic text corpus

The Hebrew treebank consists of over 24k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

Source	Genres	no. Sentences	no. Complete Documents
Davar	Hebrew news	386	16
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	2662	0
Kol Zchut	Information about entitlements rights of Israelis	9801	285
Wikipedia	Biographies, events, locations, legal, medical	11639	75

Two independent annotations for over 1k of the 23k total sentences. Kappa inter-annotator agreement score per-feature.

Our named entities corpus includes over 70k sentences, over 2k documents in Arabic and Hebrew, annotated for 12 entity types.

Language	Source	no. Sentences	no. Articles
Hebrew	Davar	43k	1200
Arabic	Davar and Kul al-Arab	30k	971

Our preview release of Palestinian Arabic consists of 12k tokens of transcribed text from 10 YouTube videos. Seven of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

Release 2022-04-12

The data consists of nearly 20k sentences, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, and the technology magazine Geektime.

Source	Genres	no. Sentences	no. Complete Documents
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	2662	0
Kol Zchut	Information about entitlements rights of Israelis	5576	152
Wikipedia	Biographies, events, locations, legal, medical	11229	75

Two independent annotations for over 1k of the 18.5k total sentences. Kappa inter-annotator agreement score per-feature.

Also included are preview samples of named entity annotation, along with the annotation guidelines. Excerpts from Hebrew and Arabic sources were annotated with minimal entity spans for 13 entity types.

Language	Source	Sample Size
Hebrew	Wikipedia	39 paragraphs
Arabic	Kol Zchut Arabic	50 sentences

Hebrew-Arabic Parallel Corpus 2021-11-21

The dataset consists of over 4k parallel articles from the civil-legal domain. High-quality human translation from Hebrew to Arabic.

Alignment on the sentence level. The metadata includes two scores:

The confidence of the model in the alignment.
The "uniqueness" of the vocabulary of each article compared to the rest of the articles.

Source	Genre	no. Articles	no. Sentences
All Rights	Government and legal data about rights, covering a wide range of topics: health, old age, religion, immigration, election	4,400 parallel articles	200,000 parallel sentences

Hebrew Treebank 2021-11-21

The data consists of dozens of articles from Wikipedia and from the technology magazine Geektime.

Source	Genres	no. Sentences
GeekTime	News, writing for the web, sometimes nonstandard Hebrew	2580
Wikipedia	Biographies, events, locations, legal, medical	6398

Two independent annotations for over 1.2k of the 9k total sentences. Adjudication by a third annotator. Kappa inter-annotator agreement score per-feature.

IAHLT Linguistic Datasets

Download

Datasets

Release 2024-03-07

Release 2023-09-06

Release 2023-06-06

Release 2023-05-29

Release 2023-03-16

Release 2022-12-01

Release 2022-09-06

Release 2022-06-09

Release 2022-04-12

Hebrew-Arabic Parallel Corpus 2021-11-21

Hebrew Treebank 2021-11-21