mcallaghan / text-as-data Goto Github PK

View Code? Open in Web Editor NEW

12.0 7.0 20.0 59.27 MB

TeX 5.08% R 0.11% Jupyter Notebook 27.78% HTML 67.02%

text-as-data's Introduction

Text as Data

This repository contains materials for the Text as Data course taught at the Hertie School of Governance.

text-as-data's People

Contributors

Stargazers

Watchers

text-as-data's Issues

NFM doesn't have the right attribute

I have an issue in accessing the attribute of NFM. In particular, the nfm.n_components_ and nfm.components_ one.
I found two resources that refers to two different release of sklearn, version 1.1.3 (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html), version 0.18.1 (http://www.devdoc.net/python/sklearn-0.18/modules/generated/sklearn.decomposition.NMF.html). I check the version of sklearn I am using, the code assures me that I am using the stable and version (1.1.3). I also printed the attribute that I can access, I attached the code and output:

nmf = NMF(10)
print(nmf._n_features_out)

output:

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_check_feature_names', '_check_n_features', '_check_params', '_check_w_h', '_fit_transform', '_get_param_names', '_get_tags', '_more_tags', '_n_features_out', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_scale_regularization', '_validate_data', 'alpha', 'alpha_H', 'alpha_W', 'beta_loss', 'fit', 'fit_transform', 'get_feature_names_out', 'get_params', 'init', 'inverse_transform', 'l1_ratio', 'max_iter', 'n_components', 'random_state', 'regularization', 'set_params', 'shuffle', 'solver', 'tol', 'transform', 'verbose']

Second part of the output, error that raise:

` AttributeError Traceback (most recent call last)
Cell In [47], line 3
1 nmf = NMF(10)
----> 2 nmf._n_features_out

File c:\Users\zazzo.virtualenvs\Text_as_data-P4hExxh8\lib\site-packages\sklearn\decomposition_nmf.py:1770, in NMF._n_features_out(self)
1767 @Property
1768 def n_features_out(self):
1769 """Number of transformed output features."""
-> 1770 return self.components.shape[0]

AttributeError: 'NMF' object has no attribute 'components_' `

It is clear that the wanted attributes are not in the list. More interesting, none of the class in the on-line documentation is the one I am currently running.

Accessing those value is fundamental to complete the assignment 2, any help is much appreciated!

Merging the model and with the original corpus

As many had the issues, Max asked me to share my solution for merging the result of the model analysis with the original corpus in order to visualize it:

length(topics(lda.model)) #the length of the lda-model I have created

#both objects do not share the same length
#therefore, we create a new dataframe with the original corpus

df <- data.frame(text=as.character(corp_manif)) 
df$id <- names(corp_manif)
df$partyyear <- manif$partyyear #I add the partyyear variable to identify the resepctive party manifesto

#now we create a document with the model data
doc_topics <- tidy(lda.modell, matrix="gamma") %>%
left_join(df, by=c("document"="id"))
#and we add the document to our new dataframe

names(corp_manif)```

acceptable submission formats (Assignment 1)

I was wondering what kind of submission formats are permitted when working with RMarkdown files. I think it is sometimes more handy to work with HTML output formats rather than PDFs since a lot of the code (chunks) as well as the internally generated things (such as plots) are displayed in a better format (easier to read and edit) for both readers and writers.

Thanks in advance @mcallaghan

Session 7 exercise work around

if you're facing the problem of not being able to download the ZIP file from https://nlp.stanford.edu/projects/glove/ you can manually re-create the folder structure for get the function working:

glove <- embedding_glove6b( dimensions = 100, dir = "embeddings", manual_download = TRUE )

you have to create a ZIP file on your own (on Mac OS it is working with right click --> Compress "…") and then rename the ZIP file as "glove.6b.zip" -- the final folder structure should look like: .../embeddings/glove6b/glove.6B.zip

scraping the links to protocols of individual sessions

It looks like the bundestag website doesn't load the list of protocol links straight away, so those of you who are trying to find and select links from https://www.bundestag.de/services/opendata using the selector a.bt-link-dokument are coming away empty handed.

In any case, we just need to select one xml link, and it is easiest to do this entirely by hand. Simply right-click on one of the XML links and press "copy html location", we can then parse this protocol directly using this link.

csv plus loading text in xlm

Dear Max,
In the csv you added, there appears to be a mistake. Infant sorrow text has other poems in it. It shows that this particular poem has 111 lines. Not sure if this was done on purpose, so I though I would mention it.

Dear All
I also run into a small issue. I am working in r and I try to access the text spoken in the parliament. However, I cannot access the element J_1 nor J. I managed to get all the names and surnames, but accessing this one is not working for me. I can access the text by using html_text(rede_data), but this gives me all text elements in rede, like names, comments, and party names. I could of cleaned it, but it is additional time, so I am wondering if someone have found a way around?

Structuring a peom

Hello,

I am running into a question on if the assignment needs us to get rid of the introductory text at the beginning and the legal text at the end. I am working on how to split the poems into different chunks and this seems to be the first issue to deal with before I can do more with it. Not sure how to structure the chunks after that in RegEx

Currently working with :

Gutenberg <- read_file("https://www.gutenberg.org/cache/epub/1934/pg1934.txt")

lines<-str_split(Gutenberg,"\r\n")
Books<-str_split(Gutenberg,"\r\n\r\n\r\n\r\n\r\n ")[[1]]

Assignment 1 Task 2: Plotting a histogram

In task 2 we are asked to "create a histogram showing the number of lines per poem". However, according to my understanding, a histogram is showing the distribution of a numerical variable. Showing the number of lines per poem (poem is a categorical variable) would imply to create a barplot (lines per poem on y axis, poem on x axis).

Thanks in advance.

Exploding nested dictionary

I constructed a nested dictionary with all information of the books, poems, stanza and lines. In total it has 4 layers (book, poem title, stanza number, lines). I tried to use stack() and explode() to transform it into a dataframe. But it only works if there are only 3 layers (so for each book separately). Now with the 4 layers, it removes the "deepest" level when trying to explode() it. Someone any clue/idea on that?

Lemmatization in German

Hello!

I am trying to lemmatize my German language tokens - any hints on how I could do so? E.g. packages to use (optimally in combination with quanteda)?

I'd greatly appreciate any help!

Sonja

mcallaghan / text-as-data Goto Github PK

text-as-data's Introduction

Text as Data

text-as-data's People

Contributors

Stargazers

Watchers

Forkers

text-as-data's Issues

Recommend Projects

Recommend Topics

Recommend Org