View Issue Details

IDProjectCategoryView StatusLast Update
0012319mantisbtattachmentspublic2019-08-25 16:57
Reportergthomas Assigned To 
PrioritylowSeverityfeatureReproducibilityN/A
Status newResolutionopen 
Product Version1.2.17 
Summary0012319: Index attachments' content
Description

I'd like to have a plugin for (full-text) indexing attachments' (.doc, .odt, .pdf) content.

Additional Information

Possibilities:
a) use a separate Apache Lucene instance with some (RESTful HTTP?) interface.
b) use Apache Tika parser with PostgreSQL tsearch2 full-text indexer.

a) seems hard work and maybe heavyweight (Java servlet running on some servlet engine)
b) is waay easier - at least when you're already using PostgreSQL under your Mantis...

Tagsfts, plugin, postgresql
Attached Files

Activities

gthomas

gthomas

2010-09-05 15:39

reporter   ~0026579

I'd need some suggestions: onine or offline indexing of uploaded files?
If offline, then should I call the "java -jar tika-app.jar" directly from PHP, or should that be run from some cron script?

Any other ideas?

dhx

dhx

2010-09-19 02:58

reporter   ~0026782

This is a big undertaking.

I think you'd ideally want to perform indexing on a cron job cycle at low IO/CPU priority (ionice + renice). By calling an indexing command every time a file is uploaded you could potentially end up with multiple CPU intensive processes running at a time on your server. With a cron job you have much better control over what times of the day the intensive CPU workload is performed and how many CPUs should be used concurrently.

Of course, this would make it Linux-only which is a potential downside. Although saying that, it is a plugin, and someone could create a Windows specific version of this plugin if they wanted to. Or they could contribute patches later to add Windows support to the plugin you're proposing.

I'm a little concerned about how this will work when we support many different database types. I guess you could just make a full text search plugin specific for PostgreSQL, etc but then you'd be limiting the number of users who can use your plugin.

dhx

dhx

2010-09-19 02:59

reporter   ~0026783

Not to mention the multiple different ways in which attachments can be stored:

1) On a remote FTP server

2) As a file within the uploads/files directory

3) Within the database as big blobs

gthomas

gthomas

2010-09-19 03:47

reporter   ~0026786

This absolutely a WIP, but things works now:

  • extract with antiword/unzip/pdftotext OR tika
  • indexing backend: PostgreSQL's TSearch2 OR Xapian
  • indexing in a cronjob (uses file_api's file_get_content, so storage method is indifferent).

So indexing works, but usage (embed in "View Issues" page) is missing (hopefully next week), and configuration needs more work, too.

GThomas

gthomas

gthomas

2010-09-19 15:50

reporter   ~0026788

Now search works, but why do I need to set $g_plugin_current[0] = 'AttachmentIndexer' (plugin's name) every time? (not just from the cron job, but from IndexerFilter.class.php, too).

gthomas

gthomas

2010-09-20 05:28

reporter   ~0026796

Attached a working (at least with TSearch2) version, without tika-app-0.7.jar (17MB).

gthomas

gthomas

2010-09-25 07:49

reporter   ~0026857

Since mantisforge doesn't accept my push efforts, uploaded it to
https://tgulacsi@github.com/tgulacsi/mantis-attachmentindexer.git