You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> x=ngram(c("der","die","der die", "der+die","der die + die"), corpus = "de-2019", smoothing=0, count=TRUE)
> x
# Ngram data table
# Phrases: (der + die), (der die + die), der, der die, die
# Case-sensitive: TRUE
# Corpuses: de-2019
# Smoothing: 0
# Years: 1800-2019
Year Corpus Phrase Frequency Count
1 1800 de-2019 (der die + die) 2.205e-02 1560805
2 1800 de-2019 der die 7.745e-05 5582
3 1800 de-2019 der 2.403e-02 1747143
4 1800 de-2019 die 2.197e-02 1597683
5 1800 de-2019 (der + die) 4.600e-02 3285704
6 1801 de-2019 (der die + die) 2.254e-02 1618642
# ... with 1094 more rows
> (1597683+1747143) # line 3 + line 4
[1] 3344826
> 3285704-3344826 # line 5 minus the above sum
[1] -59122
> -59122/3285704 # relative error
[1] -0.01799
However the frequencies do seem to add up perfectly.
The count is well defined and should be additive:
count of "(house + cat)" = (count of "house") + (count of "cat")
Yet "(der + die)" is no n-gram at all but either a "n-gram set specifier" (select all n-grams that are either "der" or "die") or it is an expression for using google as a calculator. Because e.g. "(cat + cat)" results in twice the frequency, it has no intuitive set interpretation and is probably meant in the the calculator sense.
As frequency is empirically additive in the google interface (and from their help page) , frequency must be:
frequency of "(cat + house cat)" = (frequency of "cat") + (frequency of "house cat") =
(count of "cat" / count of 1-grams) + (count of "house cat" / count of 2-grams) +
This would make the result "being of mixed frequencies".
Knowing both – counts and frequencies – it's possible to derive the count of grams:
> mutate(x,Count/Frequency)[c(3,4,5,2,1),]
# Ngram data table
# Phrases: (der + die), (der die + die), der, der die, die
# Case-sensitive:
# Corpuses: de-2019
# Smoothing:
# Years: 1800-1800
Year Corpus Phrase Frequency Count Count/Frequency
3 1800 de-2019 der 2.403e-02 1747143 72711921 # 1-grams, probably correct
4 1800 de-2019 die 2.197e-02 1597683 72711922 # 1-grams, probably correct
5 1800 de-2019 (der + die) 4.600e-02 3285704 71426690 # 1-grams, should be: Count=3344826
2 1800 de-2019 der die 7.745e-05 5582 72075263 # 2-grams, probably correct
1 1800 de-2019 (der die + die) 2.205e-02 1560805 70784082 # 1:2-grams, should be: Count=1603265
The bug results in rather large estimation errors.
"(der + die)" with fixed Count results in Count/Frequency=72711921 – which would probably be correct.
"(der die + die)" with fixed Count results in Count/Frequency=72709686 – which is somewhat confusing.
I haven't yet managed to extract counts from the google interface, although it's important to gauge sampling risks.
E.g. nouns seem to become fewer since the year 2000.
The text was updated successfully, but these errors were encountered:
Thanks for pointing this out: I will have a look. I think that the problem is that, since the web page the data is scraped from doesn't provide counts, I have a use an approximate calculation using a separate table of data with provides 1-gram counts by year. I don't think that calculation will work well with operators. I will try to investigate.
I've just realised there is another problem: I am also looking at doing an update for the latest corpus that has been released but it looks as though Google hasn't published n-gram counts.
Some counts are off by 2 to 3 % in version 1.9.3:
However the frequencies do seem to add up perfectly.
The count is well defined and should be additive:
Yet "(der + die)" is no n-gram at all but either a "n-gram set specifier" (select all n-grams that are either "der" or "die") or it is an expression for using google as a calculator. Because e.g. "(cat + cat)" results in twice the frequency, it has no intuitive set interpretation and is probably meant in the the calculator sense.
As frequency is empirically additive in the google interface (and from their help page) , frequency must be:
This would make the result "being of mixed frequencies".
Knowing both – counts and frequencies – it's possible to derive the count of grams:
The bug results in rather large estimation errors.
"(der + die)" with fixed Count results in Count/Frequency=72711921 – which would probably be correct.
"(der die + die)" with fixed Count results in Count/Frequency=72709686 – which is somewhat confusing.
I haven't yet managed to extract counts from the google interface, although it's important to gauge sampling risks.
E.g. nouns seem to become fewer since the year 2000.
The text was updated successfully, but these errors were encountered: