Miguel 雷明
Community Tutor
Word-frequency-based content in Spanish?

Hi, everyone. I've been a Spanish tutor here for a couple of years, and one of my ongoing frustrations has been finding content (text, audio, video) suitable for learners.


Some years ago, the Subtlex movement compiled subtitle-based word frequency lists for several languages, including Spanish. The theory is that word frequencies extracted from subtitles are a better reflection of spoken language than those extracted from newspapers and books. The EasyPronunciation website has a word counter based on one of this lists, and it allows you to evaluate how difficult to read a given text will be. I got ahold of the list they used, and it seems to be based on movie subtitles both from Spain and from the Americas, so it's a very good starting point. I decided to test some "for learners" materials using this word counter...


A typical "intermediate" reading in Veintemundos.com has 12-22% of content OUTSIDE the top 5,000 Spanish words, and 36-38% OUTSIDE the top 1,000 Spanish words.  Instituto Cervantes' "Lecturas paso a paso" fare a bit better: The lowest level texts have 6-12% of words outside the top 5K, or around 25% of words outside the top 1K. As a reminder, researchers often insist that the number of unknown words must be kept at 2% or less for any given level. And as a personal note, both sites can be slow reads even for natives! Text and video for children are no better: Quite often, materials for children are either designed to teach them lots of new words at once, like fruit names or animal names, thus frequently going to low-frequency word territory, or just downright boring. The popular "Extra en español" show yielded similar results: The first episode has 11% of content outside the top 5K words, and 27% outside the top 1K. Paco Ardit's books on Amazon didn't fare a lot better.


So I decided to follow XKCD's example (Google: Up Goer Five, Thing Explainer) and try to write content myself using only the top 1K words, top 2K words, and so on, to create graded content for learners that is actually based on objective measurements. But before embarking in this adventure, I want to make sure I'm not trying to reinvent the wheel: Is anyone out there already doing this?

Apr 27, 2018 2:48 PM
Comments · 5

Hi, K P.

Yes, Readlang already does that automatically for any web page, and FluentU already does it manually for lots of Spanish videos. I also use a similar system for reading Chinese. The thing is, even if checking the meaning of an unknown word is as easy as tapping on said word, you're already breaking the flow of reading/listening.


Well, as they say: If you want something and nobody has done it, go do it yourself. I've purchased the domain www.spanishinput.com and I'm working on it. You should see it up in the next 48 hours.

April 27, 2018
Miguel, what about using normal intemedate text, but supplying words "outside" the range with glosses (mouseover or just in parentheses)?

A simple script can mark them for you. But glosses must be checked (and may be even written) manually. Technically you create a web-based html page with a button that allows the reader to supply "all the words outside 365 most common" with glosses.
Anyway, such partly glossed texts are nothign new, it is the way how books for learners were printed in 20th century.

P.S. in any event, creating such content, as what you've described makes sense just because there is not enough such content! No I'm not aware of any.

April 27, 2018
Por favor,  quieres  que alguien ayudarme habla la langue espagnole
April 27, 2018
Sounds like a fantastic idea!
April 27, 2018

Hi Miguel,

I remember having seen a tool that stripped some texts down to a certain frequency level. I looked around but couldn't locate it.  May be it is somewhere here: https://www.lextutor.ca/


Saludos,


Antonio


April 27, 2018
Miguel 雷明
Language Skills
Chinese (Mandarin), English, Japanese, Latin, Spanish
Learning Language
Chinese (Mandarin), Japanese, Latin