This workshop was made possible by the European Science Foundation via the NetWordS network (supporting grant received by NetWordS - 09-RNP-089 to Jesús Fernández-Domínguez.
This three-day workshop is an introduction to data-mining of lexical databases.
Day 1 focusses on queries on simple, nicely structured files with R.
Day 2 focusses on perl scripts for queries on more complex databases.
Day 3 focusses on data visualisation, using R (ggplot2) and Excel / R tools.
This is meant as a hands on session, a workshop in a computer pool (Room 208, Olympe de Gouges). Participants are limited to 18, unless they bring their own laptops. Participants are expected to have installled the relevant R packages and software on their machines.
Room 208 (computer pools)
Bât. Olympe de Gouges
8 rue Albert Einstein
(Data mining with R for simple files (CSV, such as Busca Palabaras, data in Spanish)
0900 Nicolas Ballier Introduction
0930 Vincent Renner, Jesus Fernandez & Nicolas Ballier : blends in English, French and Spanish : the blonset conjecture
0945 NB : Rstudio and R packages
1000 Véronique Pouillon : input, output files in R
1045 coffee break
1105 Initial simple queries
1230 lunch break
1400 More complex queries
1600 coffee break
1630 Even more complex queries
1730 Some limits or caveats for R as a concordancer
1800 end
Room 208 (computer pools)
0945 NB : Perl for linguists
1000 Véronique Pouillon : Perl basic queries
1045 coffee break
1105 Initial simple queries
1230 lunch break
1400 More complex queries
1600 coffee break
1630 Even more complex queries
1730 Some limits or caveats for R vs. perl
1800 end
Room 208 (computer pools) : 9h30 ⇒ 12h , 14h ⇒ 17h
Bât. Olympe de Gouges
8 rue Albert Einstein
This session will explain the logic of packages and demo some of the functions of the zipfR package.
This session will demo data visualisation with R and easy transitions from Excel to R, RExcel.
Linguistic datasets will be used for illustration purposes. The morning session will focuss on words, wordles, bar charts, spider charts, and possible applications of the Levenshtein distance. The afternoon session will demo more complex data visualisation, boxplots and other plots for phonetics, using {ggplot2} package.
room 163
Bât. Olympe de Gouges
8 rue Albert Einstein
75013 Paris
Discussants : Alain Diana, Nicolas Ballier
0900 Nicolas Ballier : introduction
0915 - 945 Adrien Méli : Vowels in the Longdale corpus: a longitudinal approach
0945 - 1015 Thomas Gaillat :Using R for this and that in the Longdale corpus
1015 -1045 coffee break
1045- 1130 Nicolas Ballier : Metrics in learner corpora (Longdale, ANGLISH, AixOx) , a roadmap
1130 -1200 Final discussion and next steps
Directrice : Pr Natalie Kübler
Centre de Linguistique Inter-langues,
de Lexicologie, de Linguistique Anglaise
et de Corpus-Atelier de Recherche sur la Parole
EA 3967
8 place Paul Ricœur
75013 Paris
Case courrier 7002
5 rue Thomas Mann
75205 Paris cedex 13