The Israeli Association of Human Language Technologies

IAHLT Linguistic Datasets

IAHLT specializes in compiling linguistic resources for natural language processing. We collect large corpora in Hebrew and Arabic. The data is annotated by highly skilled linguists for morphology, syntax, and semantics. We adhere to high standards of quality control. A subset of each dataset is annotated by two different annotators and adjudicated by a third person. Our data is released with transparent scores of inter-annotator agreement along with the data on which these scores were calculated.

Download

IAHLT Member Representative:







Datasets

  1. Release 2022-09-06

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, Palestinian Arabic text corpus, and a preview sample of our Arabic corpus annotated for lemma and part-of-speech.

    The Hebrew treebank consists of over 32k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

    Source Genres no. Sentences no. Complete Documents
    Davar Hebrew news 7563 253
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2685 0
    Kol Zchut Information about entitlements rights of Israelis 10306 289
    Wikipedia Biographies, events, locations, legal, medical 11278 74

    Two independent annotations for over 1.7k of the 30k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 85k paragraphs, nearly 7.5k documents in Arabic and Hebrew, annotated for 12 entity types.

    Language Source no. Paragraphs no. Articles
    Hebrew Davar, Wikipedia and Israel Hayom 32k 2046
    Arabic Davar and Kul al-Arab 53k 5467

    Our Palestinian Arabic corpus consists of 35k tokens of transcribed text from 18 YouTube videos. Eight of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

    Year Videos Duration (min) no. Tokens
    2011-2015 9 138 3037
    2016-2020 19 171 15050
    2021- 5 206 15182

    Also included is a preview release of our Arabic lemmatisation and part-of-speech tagging corpus. The dataset consists of 388 annotations of 292 sentences (with a total of 11058 tokens annotated, 2009 unique lemmas) for lemma and part-of-speech. The texts were sampled from the Kul al-Arab news organisation. For the multiply-annotated sentences, inter-annotator agreement statistics are included.

  2. Release 2022-06-09

    The data consists of our Hebrew treebank, Hebrew and Arabic named entity corpora, and a preview sample of our Palestinian Arabic text corpus

    The Hebrew treebank consists of over 24k dependency trees, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, the technology magazine Geektime, and the news site Davar.

    Source Genres no. Sentences no. Complete Documents
    Davar Hebrew news 386 16
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2662 0
    Kol Zchut Information about entitlements rights of Israelis 9801 285
    Wikipedia Biographies, events, locations, legal, medical 11639 75

    Two independent annotations for over 1k of the 23k total sentences. Kappa inter-annotator agreement score per-feature.

    Our named entities corpus includes over 70k sentences, over 2k documents in Arabic and Hebrew, annotated for 12 entity types.

    Language Source no. Sentences no. Articles
    Hebrew Davar 43k 1200
    Arabic Davar and Kul al-Arab 30k 971

    Our preview release of Palestinian Arabic consists of 12k tokens of transcribed text from 10 YouTube videos. Seven of these videos were independently annotated twice to study inter-annotator agreement; the annotations and our calculated agreements are included.

  3. Release 2022-04-12

    The data consists of nearly 20k sentences, including hundreds of complete documents, from Hebrew Wikipedia, the entitlement advocacy organisation Kol Zchut, and the technology magazine Geektime.

    Source Genres no. Sentences no. Complete Documents
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2662 0
    Kol Zchut Information about entitlements rights of Israelis 5576 152
    Wikipedia Biographies, events, locations, legal, medical 11229 75

    Two independent annotations for over 1k of the 18.5k total sentences. Kappa inter-annotator agreement score per-feature.

    Also included are preview samples of named entity annotation, along with the annotation guidelines. Excerpts from Hebrew and Arabic sources were annotated with minimal entity spans for 13 entity types.

    Language Source Sample Size
    Hebrew Wikipedia 39 paragraphs
    Arabic Kol Zchut Arabic 50 sentences
  4. Hebrew-Arabic Parallel Corpus 2021-11-21

    The dataset consists of over 4k parallel articles from the civil-legal domain. High-quality human translation from Hebrew to Arabic.

    Alignment on the sentence level. The metadata includes two scores:

    1. The confidence of the model in the alignment.
    2. The "uniqueness" of the vocabulary of each article compared to the rest of the articles.

    Source Genre no. Articles no. Sentences
    All Rights Government and legal data about rights, covering a wide range of topics: health, old age, religion, immigration, election 4,400 parallel articles 200,000 parallel sentences
  5. Hebrew Treebank 2021-11-21

    The data consists of dozens of articles from Wikipedia and from the technology magazine Geektime.

    Source Genres no. Sentences
    GeekTime News, writing for the web, sometimes nonstandard Hebrew 2580
    Wikipedia Biographies, events, locations, legal, medical 6398

    Two independent annotations for over 1.2k of the 9k total sentences. Adjudication by a third annotator. Kappa inter-annotator agreement score per-feature.