Code Monkey home page Code Monkey logo

datasetcollection's Introduction

DatasetCollection

Common datasets used in our research

Recommender systems

Social Recommendation

   
Data Set Basic Meta User Context
Users ItemsRatings (Scale) Density Users Links (Type)
Ciao [1] 7,375 105,114 284,086 [1, 5] 0.0365% 7,375 111,781 Trust
Epinions [2] 40,163 139,738 664,824 [1, 5] 0.0118% 49,289 487,183 Trust
Douban [3] 2,848 39,586 894,887 [1, 5] 0.794% 2,848 35,770 Trust
LastFM [8] 1,892 17,632 92,834 implicit 0.27% 1,892 25,434 Trust

Music Recommendation

   
Data Set Basic Meta Context
Users Tracks ArtistsAlbums Record Tag User Profile Artist Profile
NowPlaying [9] 1,744 16,864 2,108 N/A 1,117,335 N/A N/A N/A
Xiami [10] 4,271 290,312 33,316 95,003 1,301,486 Yes N/A N/A
Yahoo Music [source] 1,800,000 136,000 many many 717,000,000 Yes N/A N/A
30 Music [source][11] 45167 5023108 595049 217337 many Yes Yes N/A

Paper Recommendation

 
Data Set Basic Meta Context
Users Papers FeedBackTag Content
CiteULike [12] 7,947 25,975 134,860 52,946 full abstract

Location Recommendation

 
Data Set Basic Meta Context
Users Locations FeedBackrelation Time
Gowalla 18,737 32,510 1,278,274 Yes Yes

Product Recommendation

 
Data Set Basic Meta Context
Users Items CategoryBehavior Type Time
Taobao(Extraction code: xv8o)[24, 25] 987,994 4,162,024 9,439 5 Yes

Spammer detection

Social Network

Data Set Non-spammer Spammer Introduction
Twitter [4] 1,295 355 The first column is the user class (i.e., 1 for non-spammers and 2 for spammers) and the subsequent columns numbered from 1 to 62 represent the user characteristics.
YouTube [5] 641 31 (promoter) 157(spammer) The first column is the user class (i.e., 1 for promoters, 2 for spammers, and 3 for legitimates) and the subsequent columns numbered from 1 to 60 represent the user characteristics.

Shilling Detection

       
Data Set Non-spammer Spammer Introduction
Amazon [6] 3,118 1,937 Colunms in profiles.txt follow this order: userid itemid rating.
    In labels.txt: 1: spammer 0: non-spammer
Yelp [7] 52,815 80,466 Colunms in yelp.txt follow this order: user_id prod_id rating label date.
    labels -1: spammer 1: non-spammer
I recommend you to filter users who have less than 5 ratings. *More information can be found in Google Drive

Cyberbullying Detection

Data Set Year Annotated method # Data # Cyberbullying Cyberbullying Ratio
Formspring [13] 2010 Crowdsourcing 3,915 369 9.43%
MySpace [14] 2011 Expert Labeling 2,088 434 20.79%
Ask.fm [15] 2014
Instagram [16] 2014 Crowdsourcing 1,954 567 29%
Vine [17] 2015 Crowdsourcing 971 304 31.34%
BullyingV3.0 [18] 2015 Label Algorithm 7,321 2,102 28.71%
WOW [19] 2016 Expert Labeling 16,975 137 0.81%
LOL [19] 2016 Expert Labeling 17,354 207 1.19%
Twitter [20] 2017 Crowdsourcing 1,303 58 4.45%
Wikipedia [21] 2017 Crowdsourcing 37,611 338 0.9%
Harassment-Corpus [22] 2018 Expert Labeling 24,189 3,119 12.89%
Hate and Abusive Speech [23] 2018 Crowdsourcing 99,799 46,009 46.1%

Reference

[1]. Tang, J., Gao, H., Liu, H.: mtrust:discerning multi-faceted trust in a connected world. In: International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, Wa, Usa, February. pp. 93–102 (2012)

[2]. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the 2007 ACM conference on Recommender systems. pp. 17–24. ACM (2007)

[3]. G. Zhao, X. Qian, and X. Xie, “User-service rating prediction by exploring social users’ rating behaviors,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp. 496–506, 2016.

[4]. Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V.: Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS). Vol. 6, No. 2010, p. 12. 2010.

[5]. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., & Gonçalves, M.: Detecting spammers and content promoters in online video social networks. In: Proceedings of the 32nd ACM SIGIR conference on Research and development in information retrieval. pp. 620-627. ACM (2009)

[6]. Xu, Chang, et al. "Uncovering collusive spammers in Chinese review websites." ACM International Conference on Conference on Information & Knowledge Management ACM, 2013:979-988.

[7]. Rayana, Shebuti, and L. Akoglu. "Collective Opinion Spam Detection: Bridging Review Networks and Metadata." ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, 2015:985-994.

[8]. Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recom- mender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems (RecSys 2011). ACM, New York, NY, USA

[9]. Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26

[10]. Wang, Dongjing, et al. "Learning music embedding with metadata for context aware recommendation." Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016.

[11]. Turrin R, Quadrana M, Condorelli A, et al. 30Music Listening and Playlists Dataset[C]//RecSys Posters. 2015.

[12]. Hao Wang*, Wu-Jun Li, Relational collaborative topic regression for recommender systems. IEEE Transactions on Knowledge and Data Engineering (TKDE), 27(5): 1343-1355, 2015.

[13]. Reynolds K, Kontostathis A, Edwards L. Using machine learning to detect cyberbullying. Machine learning and applications and workshops (ICMLA), 2011 10th International Conference on. IEEE, 2011, 2: 241-244.

[14]. Bayzick J, Kontostathis A, Edwards L. Detecting the presence of cyberbullying using computer software. In 3rd Annual ACM Web Science Conference (WebSci ‘11). 2011: 1-2.

[15]. Hosseinmardi H, Ghasemianlangroodi A, Han R, et al. Towards understanding cyberbullying behavior in a semi-anonymous social network. Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on. IEEE, 2014: 244-252.

[16]. Hosseinmardi H, Mattson S A, Rafiq R I, et al. Analyzing labeled cyberbullying incidents on the Instagram social network. International Conference on Social Informatics. Springer, Cham, 2015: 49-66.

[17]. Rafiq R I, Hosseinmardi H, Han R, et al. Careful what you share in six seconds: Detecting cyberbullying instances in Vine. Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, 2015: 617-622.

[18]. Sui J. Understanding and fighting bullying with machine learning[D]. The University of Wisconsin-Madison, 2015.

[19]. Bretschneider U, Peters R. Detecting Cyberbullying in Online Communities. ECIS. 2016: ResearchPaper61.

[20]. Chatzakou D, Kourtellis N, Blackburn J, et al. Mean birds: Detecting aggression and bullying on twitter. Proceedings of the 2017 ACM on web science conference. ACM, 2017: 13-22.

[21]. Wulczyn E, Thain N, Dixon L. Ex machina: Personal attacks seen at scale. Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017: 1391-1399.

[22]. Rezvan M, Shekarpour S, Balasuriya L, et al. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research. Proceedings of the 10th ACM Conference on Web Science. ACM, 2018: 33-36.

[23]. Founta A-M, Djouvas C, Chatzakou D, et al. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. Proceedings of the 11th International Conference on Web and Social Media, ICWSM, 2018.

[24]. Han Z, Xiang L, Pengye Z, et al. Learning Tree-based Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[25]. Han Z, Daqing C, Ziru X, et al. Joint Optimization of Tree-based Index and Deep Model for Recommender Systems. arXiv:1902.07565.

[26]. Han Z, Daqing C, Ziru X, et al. Joint Optimization of Tree-based Index and Deep Model for Recommender Systems. arXiv:1902.07565.

datasetcollection's People

Contributors

0411tony avatar coder-yu avatar yuqi-song avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

datasetcollection's Issues

What is trust for Ciao dataset?

I download Ciao dataset, which belongs to social recommadation column. So there are 2 files, ratings.txt and trusts.txt. I suppose ratings.txt is user-item interactions, but what is trusts.txt for? Could you please explain on that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.