Skip to content

[question] what kind of http request method using with file crawling? #20

@sho-suzuki

Description

@sho-suzuki

plugin version

1.3.1

gitbucket version

4.20

what is matter

under the proxy environment . I can't get content from files but can get issue, wikis.
fess-crawler.log is as follows,

# file crawling log
2018-02-13 18:12:32,511 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge
[2018-02-13 18:12:35,028 [5DFNjmEBO7Desvq7XhyO-1] WARN  Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e
org.codelibs.fess.crawler.exception.CrawlingAccessException: Failed to parse http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:184) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
        at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.codelibs.fess.crawler.exception.MultipleCrawlingAccessException: 
Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:95) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
        ... 9 more

# issue crawl log
2018-02-13 18:43:02,794 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/17

On Linux, both requests seem to return the same result.

# file request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/README.md
{"message":"Requires authentication"}
# issue request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/21
{"message":"Requires authentication"}

I think that it may be a problem in setting proxy. (Proxy discards file request)
I would like to know about the http request of the file crawl API.

thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions