The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of nearly 70k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.
Our Palestinian Arabic text corpus consists of nearly 175k tokens of transcribed text from 78 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 9697 annotations of 8481 sentences (with a total of 276k tokens annotated, 16896 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.
Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 2807 annotated sentences (with a total of 39344 tokens annotated, 4304 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.
Our text summarization preview corpus consists of summaries of 511 articles, 63 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.
Our coreference resolution corpus consists of 554 articles in Hebrew from the Davar and Israel Hayom news organizations, Knesset protocols, and Hebrew Wikipedia; and 104 articles in Arabic from the Kul Al-Arab news organization and the All Rights entitlement advocacy organization, annotated with morpheme-level spans for coreference chains.
Our Wikification corpus consists of 19 Arabic articles and 64 Hebrew articles from our named entities corpus, with 934 and 4059 resolutions (respectively) of entities to Wikidata concept entries.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 63k dependency trees, including hundreds of complete documents, from Bagatz decisions, Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Bagatz | Supreme Court decisions | 173 | 7 |
Davar | Hebrew news | 13975 | 375 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4923 | 0 |
Israel Hayom | Hebrew news | 1711 | 30 |
Knesset protocols | Protocols of the Knesset; each "document" is a separate protocol | 4819 | 897 |
Kol Zchut | Information about entitlements rights of Israelis | 13079 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 24023 | 77 |
Two independent annotations for nearly 15k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.
Our Palestinian Arabic text corpus consists of 148k tokens of transcribed text from 45 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2093 |
2015-2019 | 19 | 484 | 37469 |
2020- | 19 | 942 | 62975 |
Our Modern Standard Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and the All Rights entitlements organisation.
Our Palestinian Arabic lemmatisation and part-of-speech tagging corpus consists of 160 annotated sentences (with a total of 1830 tokens annotated, 485 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Youtube video transcripts.
Our text summarization preview corpus consists of summaries of 89 Israel Hayom articles, 32 of which have been independently summarized by two annotators. An inter-annotator agreement analysis is provided.
Our coreference resolution preview corpus 64 paragraphs from 4 articles in Hebrew and 25 paragraphs from 3 articles in Arabic, annotated with morpheme-level spans for coreference chains.
Our Wikification preview corpus consists of 10 paragraphs from our named entities corpus, with 215 resolutions of entities to Wikidata concept entries.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Davar | Hebrew news | 11224 | 369 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4213 | 0 |
Israel Hayom | Hebrew news | 386 | 7 |
Knesset protocols | Protocols of the Knesset; each "document" is a separate protocol | 2883 | 100 |
Kol Zchut | Information about entitlements rights of Israelis | 11214 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 11309 | 78 |
Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Israel Hayom, Knesset protocols, and Wikipedia | 47k | 2923 |
Arabic | Davar, Kul al-Arab, and Wikipedia | 75k | 8342 |
Our Palestinian Arabic corpus consists of 93k tokens of transcribed text from 34 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2131 |
2015-2019 | 19 | 484 | 37026 |
2020- | 11 | 686 | 47129 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5973 annotations of 4869 sentences (with a total of 171722 tokens annotated, 12190 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation and Youtube video transcripts. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Davar | Hebrew news | 11224 | 369 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4213 | 0 |
Israel Hayom | Hebrew news | 386 | 7 |
Knesset protocols | Protocols of the Knesset; each "document" is a separate protocol | 2883 | 100 |
Kol Zchut | Information about entitlements rights of Israelis | 11214 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 11309 | 78 |
Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Israel Hayom, Knesset protocols, and Wikipedia | 47k | 2923 |
Arabic | Davar, Kul al-Arab, and Wikipedia | 75k | 8342 |
Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2131 |
2015-2019 | 19 | 484 | 37026 |
2020- | 9 | 503 | 44133 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5747 annotations of 4646 sentences (with a total of 165723 tokens annotated, 11742 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 45k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Davar | Hebrew news | 11224 | 369 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4213 | 0 |
Israel Hayom | Hebrew news | 94 | 3 |
Kol Zchut | Information about entitlements rights of Israelis | 11214 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 11309 | 75 |
Two independent annotations for over 6k of the 36k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 122k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 47k | 2829 |
Arabic | Davar, Kul al-Arab, and Wikipedia | 75k | 8315 |
Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2131 |
2015-2019 | 19 | 484 | 37026 |
2020- | 9 | 503 | 44133 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 4521 annotations of 3521 sentences (with a total of 130037 tokens annotated, 10135 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 36k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 10432 | 344 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2685 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 10314 | 290 |
Wikipedia | Biographies, events, locations, legal, medical | 9749 | 74 |
Two independent annotations for over 2.5k of the 33k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 105k paragraphs, nearly 10k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 43k | 2652 |
Arabic | Davar and Kul al-Arab | 65k | 6960 |
Our Palestinian Arabic corpus consists of 68k tokens of transcribed text from 27 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2011-2015 | 5 | 69 | 9054 |
2016-2020 | 17 | 238 | 26354 |
2021- | 4 | 292 | 25841 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 2198 annotations of 2100 sentences (with a total of 68260 tokens annotated, 7187 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 32k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 7563 | 253 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2685 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 10306 | 289 |
Wikipedia | Biographies, events, locations, legal, medical | 11278 | 74 |
Two independent annotations for over 1.7k of the 30k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 85k paragraphs, nearly 7.5k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 32k | 2046 |
Arabic | Davar and Kul al-Arab | 53k | 5467 |
Our Palestinian Arabic corpus consists of 35k tokens of transcribed text from 18 YouTube videos. Eight of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2011-2015 | 9 | 138 | 3037 |
2016-2020 | 19 | 171 | 15050 |
2021- | 5 | 206 | 15182 |
Also included is a preview release of our Arabic lemmatisation and part-of-speech tagging corpus. The dataset consists of 388 annotations of 292 sentences (with a total of 11058 tokens annotated, 2009 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, and a preview sample of our Palestinian Arabic text corpus
The Hebrew treebank consists of over 24k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 386 | 16 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2662 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 9801 | 285 |
Wikipedia | Biographies, events, locations, legal, medical | 11639 | 75 |
Two independent annotations for over 1k of the 23k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 70k sentences, over 2k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Sentences | no. Articles |
---|---|---|---|
Hebrew | Davar | 43k | 1200 |
Arabic | Davar and Kul al-Arab | 30k | 971 |
Our preview release of Palestinian Arabic consists of 12k tokens of transcribed text from 10 YouTube videos. Seven of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
The data consists of nearly 20k sentences, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, and the technology magazine Geektime.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2662 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 5576 | 152 |
Wikipedia | Biographies, events, locations, legal, medical | 11229 | 75 |
Two independent annotations for over 1k of the 18.5k total sentences. Kappa inter-annotator agreement score per-feature.
Also included are preview samples of named entity annotation, along with the annotation guidelines. Excerpts from Hebrew and Arabic sources were annotated with minimal entity spans for 13 entity types.
Language | Source | Sample Size |
---|---|---|
Hebrew | Wikipedia | 39 paragraphs |
Arabic | Kol Zchut Arabic | 50 sentences |
The dataset consists of over 4k parallel articles from the civil-legal domain. High-quality human translation from Hebrew to Arabic.
Alignment on the sentence level. The metadata includes two scores:
Source | Genre | no. Articles | no. Sentences |
---|---|---|---|
All Rights | Government and legal data about rights, covering a wide range of topics: health, old age, religion, immigration, election | 4,400 parallel articles | 200,000 parallel sentences |
The data consists of dozens of articles from Wikipedia and from the technology magazine Geektime.
Source | Genres | no. Sentences |
---|---|---|
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2580 |
Wikipedia | Biographies, events, locations, legal, medical | 6398 |
Two independent annotations for over 1.2k of the 9k total sentences. Adjudication by a third annotator. Kappa inter-annotator agreement score per-feature.