jetbrains-research / pubtrends Goto Github PK
View Code? Open in Web Editor NEWScientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
License: Apache License 2.0
Scientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
License: Apache License 2.0
Results from PostgreSQL (year is presented for citing article):
pmid_citing | pmid_cited | year
-------------+------------+------
15316650 | 23453633 | 2004
The XML file for article 15316650 contains 23453633 and several other articles published since 2004 in the ReferenceList section, so this is not parser's fault: https://www.ncbi.nlm.nih.gov/pubmed/?term=15316650&report=xml&format=text
The article 15316650 was revised in 2018, but I have no idea why the reference list could be changed, as full text of the article contains only valid references.
oleg-laptop:pubtrends oleg$ java -jar crawler/build/libs/crawler-dev.jar
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
11:25:12.590 [main] INFO Created temporary directory: /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp9114403431170330541.tmp
11:25:27.656 [main] INFO Deleting directory: /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp9114403431170330541.tmp
Exception in thread "main" java.lang.NumberFormatException: For input string: "pubmed19n0001.xml.gz"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.jetbrains.bio.pubtrends.crawler.PubmedFTPHandler.getNewXMLsList(PubmedFTPHandler.kt:98)
at org.jetbrains.bio.pubtrends.crawler.PubmedFTPHandler.fetch(PubmedFTPHandler.kt:21)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.update(PubmedCrawler.kt:50)
at org.jetbrains.bio.pubtrends.MainKt.main(Main.kt:7)
Make current goals and killer features more visible.
In this case we can deal with huge sets of pmids, like in case with ['human', 'aging']
.
See #24
Use both top n-grams + top td-idf for component descriptiion
Have a look at the Log4j library example: http://javastudy.ru/log4j/log4j-hello-world-example/
without co-citations information
Right now updating stops whenever a minimal problem with Internet occurs. Additional attempts may help to avoid this problem.
This will help to design proper testing on both functionalities.
oleg-laptop:pubtrends oleg$ java -jar crawler/build/libs/crawler-dev.jar 2>&1 | tee log.txt
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
14:51:22.721 [main] INFO Created temporary directory: /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp
14:51:41.676 [main] INFO Found 976 new file(s)
14:51:41.677 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0001.xml.gz: Downloading...
14:52:14.088 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0001.xml.gz: Unpacking...
14:52:15.301 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0001.xml: Parsing...
14:52:30.269 [main] INFO Articles: 30000, keywords: 1325, citations: 0
14:52:30.270 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0001.xml: Storing...
14:52:34.869 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0001.xml: SUCCESS
14:52:34.869 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0002.xml.gz: Downloading...
14:53:01.561 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0002.xml.gz: Unpacking...
14:53:02.894 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0002.xml: Parsing...
14:53:12.243 [main] INFO Articles: 30000, keywords: 1572, citations: 0
14:53:12.243 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0002.xml: Storing...
14:53:15.991 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0002.xml: SUCCESS
14:53:15.992 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0003.xml.gz: Downloading...
14:53:41.260 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0003.xml.gz: Unpacking...
14:53:42.509 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0003.xml: Parsing...
14:53:52.145 [main] INFO Articles: 30000, keywords: 1831, citations: 0
14:53:52.145 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0003.xml: Storing...
14:53:54.862 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp3309464674076092605.tmp/pubmed19n0003.xml: SUCCESS
Example of <PubmedArticle>
new file format: pubmed19n0011.xml
:
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">304751</PMID>
<DateCompleted>
<Year>1978</Year>
<Month>04</Month>
<Day>17</Day>
</DateCompleted>
<DateRevised>
<Year>2018</Year>
<Month>11</Month>
<Day>13</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0007-1447</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>1</Volume>
<Issue>6110</Issue>
<PubDate>
<Year>1978</Year>
<Month>Feb</Month>
<Day>18</Day>
</PubDate>
</JournalIssue>
<Title>British medical journal</Title>
<ISOAbbreviation>Br Med J</ISOAbbreviation>
</Journal>
<ArticleTitle>Unilateral short thumb associated with bleeding duodenal reduplication.</ArticleTitle>
<Pagination>
<MedlinePgn>412</MedlinePgn>
</Pagination>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Modlin</LastName>
<ForeName>I M</ForeName>
<Initials>IM</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Spencer</LastName>
<ForeName>J</ForeName>
<Initials>J</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D002363">Case Reports</PublicationType>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>England</Country>
<MedlineTA>Br Med J</MedlineTA>
<NlmUniqueID>0372673</NlmUniqueID>
<ISSNLinking>0007-1447</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>AIM</CitationSubset>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000015" MajorTopicYN="Y">Abnormalities, Multiple</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000328" MajorTopicYN="N">Adult</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D014670" MajorTopicYN="N">Ampulla of Vater</DescriptorName>
<QualifierName UI="Q000002" MajorTopicYN="N">abnormalities</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004380" MajorTopicYN="N">Duodenal Obstruction</DescriptorName>
<QualifierName UI="Q000209" MajorTopicYN="N">etiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D004386" MajorTopicYN="N">Duodenum</DescriptorName>
<QualifierName UI="Q000002" MajorTopicYN="Y">abnormalities</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005260" MajorTopicYN="N">Female</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006471" MajorTopicYN="N">Gastrointestinal Hemorrhage</DescriptorName>
<QualifierName UI="Q000209" MajorTopicYN="Y">etiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D013933" MajorTopicYN="N">Thumb</DescriptorName>
<QualifierName UI="Q000002" MajorTopicYN="Y">abnormalities</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>1978</Year>
<Month>2</Month>
<Day>18</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>1978</Year>
<Month>2</Month>
<Day>18</Day>
<Hour>0</Hour>
<Minute>1</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>1978</Year>
<Month>2</Month>
<Day>18</Day>
<Hour>0</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">304751</ArticleId>
<ArticleId IdType="pmc">PMC1602955</ArticleId>
</ArticleIdList>
<ReferenceList>
<Reference>
<Citation>Br J Surg. 1960 Mar;47:477-84</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">13797465</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Am J Dig Dis. 1974 Jul;19(7):673-7</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">4209729</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Br J Surg. 1972 Apr;59(4):324-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">4623190</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Am J Surg. 1971 Sep;122(3):418-20</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">5570620</ArticleId>
</ArticleIdList>
</Reference>
<Reference>
<Citation>Arch Surg. 1967 Feb;94(2):301-6</Citation>
<ArticleIdList>
<ArticleId IdType="pubmed">6016282</ArticleId>
</ArticleIdList>
</Reference>
</ReferenceList>
</PubmedData>
</PubmedArticle>
Related to #12
This can help to add some tests on the functionality.
Can we consider caching of co-citations table? By adding Postresql view?
Limit by theme: "Ageing" or "aging"
Limit papers by timeline, starting from 2000.
At the moment after processing with all instructions in README.md
I get the following error:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Dec 14, 2018 10:59:21 AM org.postgresql.core.v3.ConnectionFactoryImpl log
WARNING: SQLException occurred while connecting to localhost:5432
org.postgresql.util.PSQLException: FATAL: role "biolabs" is not permitted to log in
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2566)
at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:131)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:210)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
at org.postgresql.Driver.makeConnection(Driver.java:452)
at org.postgresql.Driver.connect(Driver.java:254)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:112)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:91)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:25)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:22)
at kotlin.UnsafeLazyImpl.getValue(Lazy.kt:81)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction.getConnection(ThreadLocalTransactionManager.kt:31)
at org.jetbrains.exposed.sql.Transaction.getConnection(Transaction.kt)
at org.jetbrains.exposed.sql.Database.getMetadata$exposed(Database.kt:17)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:26)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getUrl(Database.kt)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:29)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getDialect$exposed(Database.kt)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialect(Default.kt:341)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialectIfAvailable(Default.kt:345)
at org.jetbrains.exposed.sql.Column.getOnUpdate$exposed(Column.kt:14)
at org.jetbrains.exposed.sql.Table.nullable(Table.kt:399)
at org.jetbrains.bio.pubtrends.crawler.Publications.<clinit>(DatabaseModel.kt:7)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:31)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:9)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:103)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:74)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler.<init>(DatabaseHandler.kt:24)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.<init>(PubmedCrawler.kt:14)
at org.jetbrains.bio.pubtrends.crawler.ParserTest.<clinit>(ParserTest.kt:9)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:250)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:260)
at org.junit.runners.BlockJUnit4ClassRunner$2.runReflectiveCall(BlockJUnit4ClassRunner.java:309)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:349)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Dec 14, 2018 10:59:21 AM org.postgresql.Driver connect
SEVERE: Connection error:
org.postgresql.util.PSQLException: FATAL: role "biolabs" is not permitted to log in
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2566)
at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:131)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:210)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
at org.postgresql.Driver.makeConnection(Driver.java:452)
at org.postgresql.Driver.connect(Driver.java:254)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:112)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:91)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:25)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:22)
at kotlin.UnsafeLazyImpl.getValue(Lazy.kt:81)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction.getConnection(ThreadLocalTransactionManager.kt:31)
at org.jetbrains.exposed.sql.Transaction.getConnection(Transaction.kt)
at org.jetbrains.exposed.sql.Database.getMetadata$exposed(Database.kt:17)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:26)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getUrl(Database.kt)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:29)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getDialect$exposed(Database.kt)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialect(Default.kt:341)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialectIfAvailable(Default.kt:345)
at org.jetbrains.exposed.sql.Column.getOnUpdate$exposed(Column.kt:14)
at org.jetbrains.exposed.sql.Table.nullable(Table.kt:399)
at org.jetbrains.bio.pubtrends.crawler.Publications.<clinit>(DatabaseModel.kt:7)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:31)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:9)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:103)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:74)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler.<init>(DatabaseHandler.kt:24)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.<init>(PubmedCrawler.kt:14)
at org.jetbrains.bio.pubtrends.crawler.ParserTest.<clinit>(ParserTest.kt:9)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:250)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:260)
at org.junit.runners.BlockJUnit4ClassRunner$2.runReflectiveCall(BlockJUnit4ClassRunner.java:309)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:349)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
java.lang.ExceptionInInitializerError
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:31)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$1.invoke(DatabaseHandler.kt:9)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:103)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:74)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler.<init>(DatabaseHandler.kt:24)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.<init>(PubmedCrawler.kt:14)
at org.jetbrains.bio.pubtrends.crawler.ParserTest.<clinit>(ParserTest.kt:9)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:250)
at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:260)
at org.junit.runners.BlockJUnit4ClassRunner$2.runReflectiveCall(BlockJUnit4ClassRunner.java:309)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:349)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
at org.junit.runners.ParentRunner.run(ParentRunner.java:396)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.postgresql.util.PSQLException: FATAL: role "biolabs" is not permitted to log in
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2566)
at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:131)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:210)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
at org.postgresql.Driver.makeConnection(Driver.java:452)
at org.postgresql.Driver.connect(Driver.java:254)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:112)
at org.jetbrains.exposed.sql.Database$Companion$connect$7.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:91)
at org.jetbrains.exposed.sql.Database$Companion$doConnect$3.invoke(Database.kt:71)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:25)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction$connectionLazy$1.invoke(ThreadLocalTransactionManager.kt:22)
at kotlin.UnsafeLazyImpl.getValue(Lazy.kt:81)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManager$ThreadLocalTransaction.getConnection(ThreadLocalTransactionManager.kt:31)
at org.jetbrains.exposed.sql.Transaction.getConnection(Transaction.kt)
at org.jetbrains.exposed.sql.Database.getMetadata$exposed(Database.kt:17)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:26)
at org.jetbrains.exposed.sql.Database$url$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getUrl(Database.kt)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:29)
at org.jetbrains.exposed.sql.Database$dialect$2.invoke(Database.kt:15)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
at org.jetbrains.exposed.sql.Database.getDialect$exposed(Database.kt)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialect(Default.kt:341)
at org.jetbrains.exposed.sql.vendors.DefaultKt.getCurrentDialectIfAvailable(Default.kt:345)
at org.jetbrains.exposed.sql.Column.getOnUpdate$exposed(Column.kt:14)
at org.jetbrains.exposed.sql.Table.nullable(Table.kt:399)
at org.jetbrains.bio.pubtrends.crawler.Publications.<clinit>(DatabaseModel.kt:7)
... 42 more
Try to use top-n gramps in top 10% of cocitations within component
At the moment only paper title is used to describe subtopic.
Pipeline is the following: paper title -> N-gramms -> TR-IDF -> max
Query:
EXPLAIN ANALYSE SELECT C1.pmid_citing, C1.pmid_cited, C2.pmid_cited, P.year
FROM Citations C1
JOIN (VALUES (10660603), (15319361), (15356349), (16424025), (16627569), (16714284), (16847060), (16955484), (17081107), (17275731), (17534140), (18414039), (18443001), (18514625), (18555664), (18769112), (18787087), (18926585), (18971624), (19200882), (19245654), (19279323), (19380253), (19448702), (19478560), (19539012), (19602051), (19740975), (19805415), (19923900), (20047144), (20096035), (20139716), (20154608), (20157541), (20169165), (20363920), (20388102), (20437201), (20445122), (20519118), (20606252), (20676050), (20729871), (20739737), (20818934), (20886754), (20965424), (21115526), (21150328), (21157483), (21159787), (21179166), (21191146), (21212465), (21415462), (21428920), (21483039), (21483870), (21501117), (21520297), (21541762), (21555915), (21562229), (21572994), (21798089), (21840335), (21858089), (21917559), (21931802), (22115588), (22125056), (22246147), (22327552), (22354768), (22363791), (22388478), (22394614), (22408430), (22410287), (22468953), (22500797), (22546364), (22580468), (22672902), (22683661), (22817723), (22958933), (22960547), (22987149), (23006971), (23061800), (23239011), (23246968), (23255104), (23276696), (23325216), (23341224), (23363784), (23374718), (23399685), (23454756), (23454868), (23470275), (23517348), (23525940), (23555298), (23606170), (23625314), (23648089), (23686362), (23688930), (23702245), (23702336), (23734707), (23817674), (23850396), (23851366), (23884442), (23936371), (23982787), (24024901), (24178346), (24236459), (24296616), (24308993), (24324270), (24336084), (24350927), (24489988), (24496328), (24508508), (24518659), (24562770), (24589862), (24607448), (24677687), (24744983), (24774073), (24799956), (24821673), (24862022), (24866016), (24899720), (24915467), (24918639), (24981831), (25038772), (25040542), (25062253), (25088526), (25110610), (25239873), (25249372), (25258312), (25341517), (25348018), (25388238), (25449851), (25470422), (25476900), (25483712), (25491300), (25540326), (25553480), (25568097), (25587030), (25596147), (25655936), (25661995), (25686248), (25758051), (25776557), (25796566), (25807975), (25827254), (25902704), (25907074), (25926513), (26017155), (26051878), (26053964), (26059377), (26158292), (26178971), (26212055), (26298231), (26359950), (26378060), (26399781), (26404510), (26431550), (26463117), (26507311), (26566676), (26598823), (26639036), (26655726), (26670233), (26679354), (26750735), (26764052), (26780446), (26879375), (26890602), (26952863), (27012089), (27036037), (27048303), (27048648), (27059126), (27071307), (27091134), (27097372), (27168224), (27179948), (27211557), (27235806), (27304501), (27330287), (27392857), (27440779), (27486771), (27501743), (27591812), (27617277), (27619662), (27694325), (27698205), (27733247), (27757122), (27789294), (27812983), (27825071), (27875990), (27897112), (27902456), (27922821), (27934653), (27959964), (27974395), (27980219), (28005429), (28012437), (28115977), (28122334), (28244876), (28254759), (28257663), (28260296), (28264931), (28301572), (28315697), (28322571), (28329151), (28371119), (28455969), (28540646), (28554316), (28603284), (28626026), (28639903), (28675698), (28694093), (28721811), (28732480), (28807816), (28831286), (28874954), (28911171), (28918902), (28929674), (28944926), (28953887), (28971552), (29027899), (29048631), (29074705), (29101804), (29157832), (29163135), (29165314), (29183728), (29316844), (29369521), (29388072), (29407795), (29408453), (29441009), (29461635), (29467291), (29473507), (29502958), (29515755), (29530582), (29570707), (29574227), (29579543), (29611102), (29726032), (29749694), (29752839), (29753771), (29804557), (29897294), (29921885), (29991711), (30036188), (30050560), (30057669), (30140974), (30153655), (30190613), (30197681), (30263780), (30359321), (30389500), (30393593), (30443855), (30510618), (30542441), (30619240), (30853664), (30902093), (31032688)) AS C1T(pmid_cited) ON (C1.pmid_cited = C1T.pmid_cited)
JOIN Citations C2
JOIN (VALUES (10660603), (15319361), (15356349), (16424025), (16627569), (16714284), (16847060), (16955484), (17081107), (17275731), (17534140), (18414039), (18443001), (18514625), (18555664), (18769112), (18787087), (18926585), (18971624), (19200882), (19245654), (19279323), (19380253), (19448702), (19478560), (19539012), (19602051), (19740975), (19805415), (19923900), (20047144), (20096035), (20139716), (20154608), (20157541), (20169165), (20363920), (20388102), (20437201), (20445122), (20519118), (20606252), (20676050), (20729871), (20739737), (20818934), (20886754), (20965424), (21115526), (21150328), (21157483), (21159787), (21179166), (21191146), (21212465), (21415462), (21428920), (21483039), (21483870), (21501117), (21520297), (21541762), (21555915), (21562229), (21572994), (21798089), (21840335), (21858089), (21917559), (21931802), (22115588), (22125056), (22246147), (22327552), (22354768), (22363791), (22388478), (22394614), (22408430), (22410287), (22468953), (22500797), (22546364), (22580468), (22672902), (22683661), (22817723), (22958933), (22960547), (22987149), (23006971), (23061800), (23239011), (23246968), (23255104), (23276696), (23325216), (23341224), (23363784), (23374718), (23399685), (23454756), (23454868), (23470275), (23517348), (23525940), (23555298), (23606170), (23625314), (23648089), (23686362), (23688930), (23702245), (23702336), (23734707), (23817674), (23850396), (23851366), (23884442), (23936371), (23982787), (24024901), (24178346), (24236459), (24296616), (24308993), (24324270), (24336084), (24350927), (24489988), (24496328), (24508508), (24518659), (24562770), (24589862), (24607448), (24677687), (24744983), (24774073), (24799956), (24821673), (24862022), (24866016), (24899720), (24915467), (24918639), (24981831), (25038772), (25040542), (25062253), (25088526), (25110610), (25239873), (25249372), (25258312), (25341517), (25348018), (25388238), (25449851), (25470422), (25476900), (25483712), (25491300), (25540326), (25553480), (25568097), (25587030), (25596147), (25655936), (25661995), (25686248), (25758051), (25776557), (25796566), (25807975), (25827254), (25902704), (25907074), (25926513), (26017155), (26051878), (26053964), (26059377), (26158292), (26178971), (26212055), (26298231), (26359950), (26378060), (26399781), (26404510), (26431550), (26463117), (26507311), (26566676), (26598823), (26639036), (26655726), (26670233), (26679354), (26750735), (26764052), (26780446), (26879375), (26890602), (26952863), (27012089), (27036037), (27048303), (27048648), (27059126), (27071307), (27091134), (27097372), (27168224), (27179948), (27211557), (27235806), (27304501), (27330287), (27392857), (27440779), (27486771), (27501743), (27591812), (27617277), (27619662), (27694325), (27698205), (27733247), (27757122), (27789294), (27812983), (27825071), (27875990), (27897112), (27902456), (27922821), (27934653), (27959964), (27974395), (27980219), (28005429), (28012437), (28115977), (28122334), (28244876), (28254759), (28257663), (28260296), (28264931), (28301572), (28315697), (28322571), (28329151), (28371119), (28455969), (28540646), (28554316), (28603284), (28626026), (28639903), (28675698), (28694093), (28721811), (28732480), (28807816), (28831286), (28874954), (28911171), (28918902), (28929674), (28944926), (28953887), (28971552), (29027899), (29048631), (29074705), (29101804), (29157832), (29163135), (29165314), (29183728), (29316844), (29369521), (29388072), (29407795), (29408453), (29441009), (29461635), (29467291), (29473507), (29502958), (29515755), (29530582), (29570707), (29574227), (29579543), (29611102), (29726032), (29749694), (29752839), (29753771), (29804557), (29897294), (29921885), (29991711), (30036188), (30050560), (30057669), (30140974), (30153655), (30190613), (30197681), (30263780), (30359321), (30389500), (30393593), (30443855), (30510618), (30542441), (30619240), (30853664), (30902093), (31032688)) AS C2T(pmid_cited) ON (C2.pmid_cited = C2T.pmid_cited)
ON C1.pmid_citing = C2.pmid_citing AND C1.pmid_cited < C2.pmid_cited
JOIN Publications P
ON C1.pmid_citing = P.pmid
LIMIT 100000;
In case of relatively short number of papers we see the following explain analyze
report:
Limit (cost=1716128.06..3443131.71 rows=3450 width=16) (actual time=79649.128..149931.732 rows=7930 loops=1)
-> Gather (cost=1716128.06..3443131.71 rows=3450 width=16) (actual time=79649.127..149938.334 rows=7930 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=1715128.06..3441786.71 rows=1438 width=16) (actual time=82233.777..149838.343 rows=2643 loops=3)
-> Parallel Hash Join (cost=1715127.49..3430160.88 rows=1438 width=16) (actual time=82232.856..149783.769 rows=2643 loops=3)
Hash Cond: (c1.pmid_citing = c2.pmid_citing)
Join Filter: (c1.pmid_cited < c2.pmid_cited)
Rows Removed by Join Filter: 6919
-> Hash Join (cost=8.12..1714918.72 rows=16052 width=8) (actual time=203.564..72203.375 rows=4275 loops=3)
Hash Cond: (c1.pmid_cited = "*VALUES*".column1)
-> Parallel Seq Scan on citations c1 (cost=0.00..1450882.60 rows=70364660 width=8) (actual time=0.034..58069.652 rows=56291727 loops=3)
-> Hash (cost=4.06..4.06 rows=325 width=4) (actual time=0.330..0.330 rows=325 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 20kB
-> Values Scan on "*VALUES*" (cost=0.00..4.06 rows=325 width=4) (actual time=0.014..0.211 rows=325 loops=3)
-> Parallel Hash (cost=1714918.72..1714918.72 rows=16052 width=8) (actual time=77565.158..77565.158 rows=4275 loops=3)
Buckets: 65536 Batches: 1 Memory Usage: 1056kB
-> Hash Join (cost=8.12..1714918.72 rows=16052 width=8) (actual time=668.721..77555.931 rows=4275 loops=3)
Hash Cond: (c2.pmid_cited = "*VALUES*_1".column1)
-> Parallel Seq Scan on citations c2 (cost=0.00..1450882.60 rows=70364660 width=8) (actual time=0.337..63168.461 rows=56291727 loops=3)
-> Hash (cost=4.06..4.06 rows=325 width=4) (actual time=2.525..2.525 rows=325 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 20kB
-> Values Scan on "*VALUES*_1" (cost=0.00..4.06 rows=325 width=4) (actual time=0.004..0.248 rows=325 loops=3)
-> Index Scan using publications_pkey on publications p (cost=0.56..8.08 rows=1 width=8) (actual time=0.018..0.018 rows=1 loops=7930)
Index Cond: (pmid = c1.pmid_citing)
Planning Time: 3.030 ms
Execution Time: 149941.944 ms
We see quite slow Parallel Seq Scan on citations c1
here, because of the fields order in citations
table.
See indexes by command line:
SELECT
tablename,
indexname,
indexdef
FROM
pg_indexes
WHERE
schemaname = 'public'
ORDER BY
tablename,
indexname;
The index is the following:
citations | citations_pmid_citing_pmid_cited_unique | CREATE UNIQUE INDEX citations_pmid_citing_pmid_cited_unique ON public.citations USING btree (pmid_citing, pmid_cited)
Swapping field order in index results in huge performance boost in case of small number of papers.
After adding index with the command:
CREATE UNIQUE INDEX citations_pmid_cited_citing_unique ON public.citations USING btree (pmid_cited, pmid_citing);
Wee get the following results:
Limit (cost=122146.00..272311.38 rows=3450 width=16) (actual time=410.287..504.142 rows=7930 loops=1)
-> Nested Loop (cost=122146.00..272311.38 rows=3450 width=16) (actual time=410.286..502.944 rows=7930 loops=1)
-> Hash Join (cost=122145.43..244419.09 rows=3450 width=16) (actual time=410.196..432.040 rows=7930 loops=1)
Hash Cond: (c1.pmid_citing = c2.pmid_citing)
Join Filter: (c1.pmid_cited < c2.pmid_cited)
Rows Removed by Join Filter: 20756
-> Nested Loop (cost=0.57..121663.29 rows=38526 width=8) (actual time=6.531..20.647 rows=12826 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..4.06 rows=325 width=4) (actual time=0.002..0.218 rows=325 loops=1)
-> Index Only Scan using citations_pmid_cited_citing_unique on citations c1 (cost=0.57..373.15 rows=119 width=8) (actual time=0.025..0.056 rows=39 loops=325)
Index Cond: (pmid_cited = "*VALUES*".column1)
Heap Fetches: 12826
-> Hash (cost=121663.29..121663.29 rows=38526 width=8) (actual time=402.974..402.975 rows=12826 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 1014kB
-> Nested Loop (cost=0.57..121663.29 rows=38526 width=8) (actual time=0.046..396.899 rows=12826 loops=1)
-> Values Scan on "*VALUES*_1" (cost=0.00..4.06 rows=325 width=4) (actual time=0.005..0.492 rows=325 loops=1)
-> Index Only Scan using citations_pmid_cited_citing_unique on citations c2 (cost=0.57..373.15 rows=119 width=8) (actual time=0.776..1.208 rows=39 loops=325)
Index Cond: (pmid_cited = "*VALUES*_1".column1)
Heap Fetches: 12826
-> Index Scan using publications_pkey on publications p (cost=0.56..8.08 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=7930)
Index Cond: (pmid = c1.pmid_citing)
Planning Time: 318.455 ms
Execution Time: 526.311 ms
Corresponding changes during DB creation is required.
22:51:09.043 [main] PubmedXMLParser INFO /var/folders/td/g2ws4hwj5tj48_j_tsfz8_tc0000gp/T/tmp6040019257560436124.tmp/pubmed19n0506.xml: Parsing...
22:51:14.054 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:14.526 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:14.792 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:15.166 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:15.424 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:15.814 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:16.202 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:16.599 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:16.965 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:17.327 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:17.645 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:18.020 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:18.440 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:18.855 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:19.259 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:19.681 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:20.033 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:20.436 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:20.760 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:21.109 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:21.447 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:21.715 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:21.925 [main] PostgresqlDatabaseHandler INFO Storing 1000 articles...
22:51:21.927 [main] PubmedCrawler INFO Deleting directory: /var/folders/td/g2ws4hwj5tj48_j_tsfz8_tc0000gp/T/tmp6040019257560436124.tmp
22:51:21.953 [main] PubmedCrawler INFO Writing stats to /Users/oleg/.pubtrends/stats.tsv
Exception in thread "main" java.lang.IllegalStateException: Value 'Final report of the amended safety assessment of Glyceryl Laurate, Glyceryl Laurate SE, Glyceryl Laurate/$leate, Glyceryl Adipate, Glyceryl Alginate, Glyceryl Arachidate, Glyceryl Arachidonate, Glyceryl Behenate, Glyceryl Caprate, Glyceryl Caprylate, Glyceryl Caprylate/Caprate,
Glyceryl Citrate/Lactate/Linoleate/Oleate, Glyceryl Cocoate, Glyceryl Collagenate, Glyceryl Erucate, Glyceryl Hydrogenated Rosinate, Glyceryl Hydrogenated Soyate, Glyceryl $ydroxystearate, Glyceryl Isopalmitate, Glyceryl Isostearate, Glyceryl Isostearate/Myristate, Glyceryl Isostearates, Glyceryl Lanolate, Glyceryl Linoleate, Glyceryl Linolena$e, Glyceryl Montanate, Glyceryl Myristate, Glyceryl Isotridecanoate/Stearate/Adipate, Glyceryl Oleate SE, Glyceryl Oleate/Elaidate, Glyceryl Palmitate, Glyceryl Palmitate/S$earate, Glyceryl Palmitoleate, Glyceryl Pentadecanoate, Glyceryl Polyacrylate, Glyceryl Rosinate, Glyceryl Sesquioleate, Glyceryl/Sorbitol Oleate/Hydroxystearate, Glyceryl $tearate/Acetate, Glyceryl Stearate/Maleate, Glyceryl Tallowate, Glyceryl Thiopropionate, and Glyceryl Undecylenate.' can't be stored to database column because exceeds leng$h org.jetbrains.bio.pubtrends.crawler.Publications.title.columnType.colLength
at org.jetbrains.exposed.sql.statements.UpdateBuilder.set(UpdateBuilder.kt:24)
at org.jetbrains.exposed.sql.statements.BatchInsertStatement.set(BatchInsertStatement.kt:28)
at org.jetbrains.bio.pubtrends.crawler.PostgresqlDatabaseHandler$store$1$1.invoke(DatabaseHandler.kt:59)
at org.jetbrains.bio.pubtrends.crawler.PostgresqlDatabaseHandler$store$1$1.invoke(DatabaseHandler.kt:11)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandlerKt.batchInsertOnDuplicateKeyUpdate(DatabaseHandler.kt:117)
at org.jetbrains.bio.pubtrends.crawler.PostgresqlDatabaseHandler$store$1.invoke(DatabaseHandler.kt:55)
at org.jetbrains.bio.pubtrends.crawler.PostgresqlDatabaseHandler$store$1.invoke(DatabaseHandler.kt:11)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:103)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:74)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.bio.pubtrends.crawler.PostgresqlDatabaseHandler.store(DatabaseHandler.kt:52)
at org.jetbrains.bio.pubtrends.crawler.PubmedXMLHandler.endElement(PubmedXMLHandler.kt:152)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLNSDTDValidator.endNamespaceScope(XMLNSDTDValidator.java:266)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2005)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(XMLDTDValidator.java:879)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2967)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at org.jetbrains.bio.pubtrends.crawler.PubmedXMLParser.parse(PubmedXMLParser.kt:33)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.downloadFiles(PubmedCrawler.kt:130)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.update(PubmedCrawler.kt:67)
at org.jetbrains.bio.pubtrends.MainKt.main(Main.kt:96)
Papers not following Bradford's law of citations count can be somewhat important In the field. See: https://en.wikipedia.org/wiki/Bradford%27s_law
To reproduce:
Process all the files up to 0972, and edit config.properties
file and launch command line:
./gradlew clean crawler:shadowJar && java -jar crawler/build/libs/crawler-dev.jar
19:04:51.864 [main] INFO Last downloaded file: pubmed19n0972.xml.gz
19:04:51.865 [main] INFO Created temporary directory: /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp
19:04:56.614 [main] INFO Found 6 new file(s)
19:04:56.614 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp/pubmed19n0973.xml.gz: Downloading...
19:05:04.082 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp/pubmed19n0973.xml.gz: Unpacking...
19:05:07.037 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp/pubmed19n0973.xml: Parsing...
19:05:29.347 [main] INFO Articles: 30000, keywords: 51392, citations: 355188
19:05:29.347 [main] INFO /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp/pubmed19n0973.xml: Storing...
19:05:37.716 [main] INFO Deleting directory: /var/folders/yx/rkbldym139jdbtx4dsr_wb0c0000gp/T/tmp1843002966264895509.tmp
RETURNING * was aborted: ERROR: duplicate key value violates unique constraint "publications_pkey"
Detail: Key (pmid)=(1766) already exists. Call getNextException to see other errors in the batch.
at org.jetbrains.exposed.sql.statements.Statement.executeIn$exposed(Statement.kt:61)
at org.jetbrains.exposed.sql.Transaction.exec(Transaction.kt:128)
at org.jetbrains.exposed.sql.Transaction.exec(Transaction.kt:122)
at org.jetbrains.exposed.sql.statements.Statement.execute(Statement.kt:29)
at org.jetbrains.exposed.sql.QueriesKt.batchInsert(Queries.kt:90)
at org.jetbrains.exposed.sql.QueriesKt.batchInsert$default(Queries.kt:60)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$store$1.invoke(DatabaseHandler.kt:43)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler$store$1.invoke(DatabaseHandler.kt:9)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:103)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:74)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:57)
at org.jetbrains.bio.pubtrends.crawler.DatabaseHandler.store(DatabaseHandler.kt:40)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.downloadFiles(PubmedCrawler.kt:113)
at org.jetbrains.bio.pubtrends.crawler.PubmedCrawler.update(PubmedCrawler.kt:54)
at org.jetbrains.bio.pubtrends.MainKt.main(Main.kt:7)
RETURNING * was aborted: ERROR: duplicate key value violates unique constraint "publications_pkey"
Detail: Key (pmid)=(1766) already exists. Call getNextException to see other errors in the batch.
at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2179)
at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1297)
at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1322)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:465)
at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:835)
at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1556)
at org.jetbrains.exposed.sql.statements.InsertStatement.execInsertFunction(InsertStatement.kt:86)
at org.jetbrains.exposed.sql.statements.InsertStatement.executeInternal(InsertStatement.kt:95)
at org.jetbrains.exposed.sql.statements.InsertStatement.executeInternal(InsertStatement.kt:12)
at org.jetbrains.exposed.sql.statements.Statement.executeIn$exposed(Statement.kt:59)
... 15 more
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "publications_pkey"
Detail: Key (pmid)=(1766) already exists.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2433)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2178)
... 24 more
oleg-laptop:pubtrends oleg$
Please add supported features with some good examples of visualization to README.md
file.
For some topics, i.e. human aging, there are tremendous number of papers. We would like to omit low cited ones to make all the processing fast and interpretable visually.
Example: id=420880, title starts with "[Changes in DNA methylation in rat during onto..."
I'm not sure whether we have problem described here - should we check. https://www.cwts.nl/blog?article=n-r2u2a4
Pubmed XML has the following details.
<PublicationType UI="D016454">Review</PublicationType>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.