The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 48k dependency trees, including hundreds of complete documents, from Knesset protocols, Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Davar | Hebrew news | 11224 | 369 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4213 | 0 |
Israel Hayom | Hebrew news | 386 | 7 |
Knesset protocols | Protocols of the Knesset; each "document" is a separate protocol | 2883 | 100 |
Kol Zchut | Information about entitlements rights of Israelis | 11214 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 11309 | 78 |
Two independent annotations for over 6k of the 39k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 123k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Israel Hayom, Knesset protocols, and Wikipedia | 47k | 2923 |
Arabic | Davar, Kul al-Arab, and Wikipedia | 75k | 8342 |
Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2131 |
2015-2019 | 19 | 484 | 37026 |
2020- | 9 | 503 | 44133 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 5747 annotations of 4646 sentences (with a total of 165723 tokens annotated, 11742 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 45k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news sites Davar and Israel Hayom.
Source | Genres | no. Sentences | no. Documents |
---|---|---|---|
Davar | Hebrew news | 11224 | 369 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 4213 | 0 |
Israel Hayom | Hebrew news | 94 | 3 |
Kol Zchut | Information about entitlements rights of Israelis | 11214 | 313 |
Wikipedia | Biographies, events, locations, legal, medical | 11309 | 75 |
Two independent annotations for over 6k of the 36k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 122k paragraphs, over 11k documents in Arabic and Hebrew, annotated for 13 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 47k | 2829 |
Arabic | Davar, Kul al-Arab, and Wikipedia | 75k | 8315 |
Our Palestinian Arabic corpus consists of 90k tokens of transcribed text from 32 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2010-2014 | 4 | 18 | 2131 |
2015-2019 | 19 | 484 | 37026 |
2020- | 9 | 503 | 44133 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 4521 annotations of 3521 sentences (with a total of 130037 tokens annotated, 10135 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 36k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 10432 | 344 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2685 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 10314 | 290 |
Wikipedia | Biographies, events, locations, legal, medical | 9749 | 74 |
Two independent annotations for over 2.5k of the 33k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 105k paragraphs, nearly 10k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 43k | 2652 |
Arabic | Davar and Kul al-Arab | 65k | 6960 |
Our Palestinian Arabic corpus consists of 68k tokens of transcribed text from 27 YouTube videos. Ten of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2011-2015 | 5 | 69 | 9054 |
2016-2020 | 17 | 238 | 26354 |
2021- | 4 | 292 | 25841 |
Our Arabic lemmatisation and part-of-speech tagging corpus consists of 2198 annotations of 2100 sentences (with a total of 68260 tokens annotated, 7187 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.
The Hebrew treebank consists of over 32k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 7563 | 253 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2685 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 10306 | 289 |
Wikipedia | Biographies, events, locations, legal, medical | 11278 | 74 |
Two independent annotations for over 1.7k of the 30k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 85k paragraphs, nearly 7.5k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Paragraphs | no. Articles |
---|---|---|---|
Hebrew | Davar, Wikipedia and Israel Hayom | 32k | 2046 |
Arabic | Davar and Kul al-Arab | 53k | 5467 |
Our Palestinian Arabic corpus consists of 35k tokens of transcribed text from 18 YouTube videos. Eight of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
Year | Videos | Duration (min) | no. Tokens |
---|---|---|---|
2011-2015 | 9 | 138 | 3037 |
2016-2020 | 19 | 171 | 15050 |
2021- | 5 | 206 | 15182 |
Also included is a preview release of our Arabic lemmatisation and part-of-speech tagging corpus. The dataset consists of 388 annotations of 292 sentences (with a total of 11058 tokens annotated, 2009 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.
The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, and a preview sample of our Palestinian Arabic text corpus
The Hebrew treebank consists of over 24k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
Davar | Hebrew news | 386 | 16 |
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2662 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 9801 | 285 |
Wikipedia | Biographies, events, locations, legal, medical | 11639 | 75 |
Two independent annotations for over 1k of the 23k total sentences. Kappa inter-annotator agreement score per-feature.
Our named entities corpus includes over 70k sentences, over 2k documents in Arabic and Hebrew, annotated for 12 entity types.
Language | Source | no. Sentences | no. Articles |
---|---|---|---|
Hebrew | Davar | 43k | 1200 |
Arabic | Davar and Kul al-Arab | 30k | 971 |
Our preview release of Palestinian Arabic consists of 12k tokens of transcribed text from 10 YouTube videos. Seven of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.
The data consists of nearly 20k sentences, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, and the technology magazine Geektime.
Source | Genres | no. Sentences | no. Complete Documents |
---|---|---|---|
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2662 | 0 |
Kol Zchut | Information about entitlements rights of Israelis | 5576 | 152 |
Wikipedia | Biographies, events, locations, legal, medical | 11229 | 75 |
Two independent annotations for over 1k of the 18.5k total sentences. Kappa inter-annotator agreement score per-feature.
Also included are preview samples of named entity annotation, along with the annotation guidelines. Excerpts from Hebrew and Arabic sources were annotated with minimal entity spans for 13 entity types.
Language | Source | Sample Size |
---|---|---|
Hebrew | Wikipedia | 39 paragraphs |
Arabic | Kol Zchut Arabic | 50 sentences |
The dataset consists of over 4k parallel articles from the civil-legal domain. High-quality human translation from Hebrew to Arabic.
Alignment on the sentence level. The metadata includes two scores:
Source | Genre | no. Articles | no. Sentences |
---|---|---|---|
All Rights | Government and legal data about rights, covering a wide range of topics: health, old age, religion, immigration, election | 4,400 parallel articles | 200,000 parallel sentences |
The data consists of dozens of articles from Wikipedia and from the technology magazine Geektime.
Source | Genres | no. Sentences |
---|---|---|
GeekTime | News, writing for the web, sometimes nonstandard Hebrew | 2580 |
Wikipedia | Biographies, events, locations, legal, medical | 6398 |
Two independent annotations for over 1.2k of the 9k total sentences. Adjudication by a third annotator. Kappa inter-annotator agreement score per-feature.